An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4

📅 2024-03-05
📈 Citations: 21
Influential: 1
📄 PDF
🤖 AI Summary
This study rigorously examines whether open-source fine-tuned discriminative models—built upon Llama, Qwen, and similar bases—can serve as reliable general-purpose evaluators of large language models (LLMs), substituting for GPT-4. Method: We construct a diverse, cross-task and out-of-distribution test suite, and propose a multidimensional evaluation framework assessing generalization, fairness, and task adaptability. Evaluation integrates human calibration and consistency analysis to quantify judgment bias. Contribution/Results: We demonstrate that fine-tuned discriminative models function fundamentally as task-specialized classifiers, lacking true general-purpose evaluation capability. On out-of-distribution tasks, their average Kendall’s τ lags behind GPT-4 by 12.6%; fairness errors are 2.3× higher; and accuracy drops by over 30% during task switching. This work provides the first empirical refutation of the “GPT-4 substitutability” hypothesis, establishing critical benchmarks and theoretical clarity on the reliability and applicability boundaries of automated LLM evaluation.

Technology Category

Application Category

📝 Abstract
Recently, there has been a growing trend of utilizing Large Language Model (LLM) to evaluate the quality of other LLMs. Many studies have fine-tuned judge models based on open-source LLMs for evaluation. While the fine-tuned judge models are claimed to achieve comparable evaluation capability with GPT-4, in this work, we conduct an empirical study of LLM-as-a-Judge. Our findings indicate that although the fine-tuned judge models achieve high performance on in-domain test sets, even surpassing GPT-4, they underperform GPT-4 across several dimensions, including generalizability, fairness and adaptability. We also reveal that the fine-tuned judge model inherently operates as a task-specific classifier, consequently imposing the limitations.
Problem

Research questions and friction points this paper is trying to address.

Evaluating fine-tuned judge models vs GPT-4 for LLM assessment
Assessing generalizability, fairness, adaptability of fine-tuned judge models
Revealing limitations of fine-tuned models as task-specific classifiers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned judge models for LLM evaluation
Comparative study with GPT-4 performance
Task-specific classifier limitations revealed
🔎 Similar Papers
No similar papers found.
H
Hui Huang
Faculty of Computing, Harbin Institute of Technology, Harbin, China
Y
Yingqi Qu
Baidu Inc., Beijing, China
J
Jing Liu
Baidu Inc., Beijing, China
M
Muyun Yang
Faculty of Computing, Harbin Institute of Technology, Harbin, China
T
Tiejun Zhao
Faculty of Computing, Harbin Institute of Technology, Harbin, China
H
Hongli Zhou
B
Bing Xu