An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4

📅 2024-03-05

📈 Citations: 21

✨ Influential: 1

career value

201K/year

🤖 AI Summary

This study rigorously examines whether open-source fine-tuned discriminative models—built upon Llama, Qwen, and similar bases—can serve as reliable general-purpose evaluators of large language models (LLMs), substituting for GPT-4. Method: We construct a diverse, cross-task and out-of-distribution test suite, and propose a multidimensional evaluation framework assessing generalization, fairness, and task adaptability. Evaluation integrates human calibration and consistency analysis to quantify judgment bias. Contribution/Results: We demonstrate that fine-tuned discriminative models function fundamentally as task-specialized classifiers, lacking true general-purpose evaluation capability. On out-of-distribution tasks, their average Kendall’s τ lags behind GPT-4 by 12.6%; fairness errors are 2.3× higher; and accuracy drops by over 30% during task switching. This work provides the first empirical refutation of the “GPT-4 substitutability” hypothesis, establishing critical benchmarks and theoretical clarity on the reliability and applicability boundaries of automated LLM evaluation.

Technology Category

Application Category

📝 Abstract

Recently, there has been a growing trend of utilizing Large Language Model (LLM) to evaluate the quality of other LLMs. Many studies have fine-tuned judge models based on open-source LLMs for evaluation. While the fine-tuned judge models are claimed to achieve comparable evaluation capability with GPT-4, in this work, we conduct an empirical study of LLM-as-a-Judge. Our findings indicate that although the fine-tuned judge models achieve high performance on in-domain test sets, even surpassing GPT-4, they underperform GPT-4 across several dimensions, including generalizability, fairness and adaptability. We also reveal that the fine-tuned judge model inherently operates as a task-specific classifier, consequently imposing the limitations.

Problem

Research questions and friction points this paper is trying to address.

Evaluating fine-tuned judge models vs GPT-4 for LLM assessment

Assessing generalizability, fairness, adaptability of fine-tuned judge models

Revealing limitations of fine-tuned models as task-specific classifiers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned judge models for LLM evaluation

Comparative study with GPT-4 performance

Task-specific classifier limitations revealed

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks