🤖 AI Summary
This study rigorously examines whether open-source fine-tuned discriminative models—built upon Llama, Qwen, and similar bases—can serve as reliable general-purpose evaluators of large language models (LLMs), substituting for GPT-4.
Method: We construct a diverse, cross-task and out-of-distribution test suite, and propose a multidimensional evaluation framework assessing generalization, fairness, and task adaptability. Evaluation integrates human calibration and consistency analysis to quantify judgment bias.
Contribution/Results: We demonstrate that fine-tuned discriminative models function fundamentally as task-specialized classifiers, lacking true general-purpose evaluation capability. On out-of-distribution tasks, their average Kendall’s τ lags behind GPT-4 by 12.6%; fairness errors are 2.3× higher; and accuracy drops by over 30% during task switching. This work provides the first empirical refutation of the “GPT-4 substitutability” hypothesis, establishing critical benchmarks and theoretical clarity on the reliability and applicability boundaries of automated LLM evaluation.
📝 Abstract
Recently, there has been a growing trend of utilizing Large Language Model (LLM) to evaluate the quality of other LLMs. Many studies have fine-tuned judge models based on open-source LLMs for evaluation. While the fine-tuned judge models are claimed to achieve comparable evaluation capability with GPT-4, in this work, we conduct an empirical study of LLM-as-a-Judge. Our findings indicate that although the fine-tuned judge models achieve high performance on in-domain test sets, even surpassing GPT-4, they underperform GPT-4 across several dimensions, including generalizability, fairness and adaptability. We also reveal that the fine-tuned judge model inherently operates as a task-specific classifier, consequently imposing the limitations.