Improving LLM-as-a-Judge Inference with the Judgment Distribution

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the evaluation bias in LLM-as-a-Judge frameworks arising from reliance solely on greedy decoding outputs. We propose a novel reasoning paradigm grounded in the full judgment probability distribution. Methodologically: (1) we systematically validate that mean aggregation outperforms mode aggregation for distribution summarization; (2) we design a risk-averse distribution aggregation strategy to enhance robustness; and (3) we empirically demonstrate that chain-of-thought (CoT) prompting compresses the judgment distribution and degrades discriminative performance. Experiments span pointwise, pairwise, and listwise evaluation paradigms. Results show that mean-based aggregation significantly surpasses greedy decoding baselines, and the risk-averse strategy further improves both accuracy and stability. To our knowledge, this is the first work to deeply integrate probabilistic distribution modeling into LLM-based evaluation, establishing a new paradigm for trustworthy and interpretable automated text assessment.

Technology Category

Application Category

📝 Abstract

Using language models to scalably approximate human preferences on text quality (LLM-as-a-judge) has become a standard practice applicable to many tasks. A judgment is often extracted from the judge's textual output alone, typically with greedy decoding. However, LLM judges naturally provide distributions over judgment tokens, inviting a breadth of inference methods for extracting fine-grained preferences. We find that taking the mean of the judgment distribution consistently outperforms taking the mode (i.e. greedy decoding) in all evaluation settings (i.e. pointwise, pairwise, and listwise). We further explore novel methods of deriving preferences from judgment distributions, and find that methods incorporating risk aversion often improve performance. Lastly, we analyze LLM-as-a-judge paired with chain-of-thought (CoT) prompting, showing that CoT can collapse the spread of the judgment distribution, often harming performance. Our findings suggest leveraging distributional output can improve LLM-as-a-judge, as opposed to using the text interface alone.

Problem

Research questions and friction points this paper is trying to address.

Improving LLM-as-a-Judge inference using judgment distributions.

Exploring methods to derive preferences from judgment distributions.

Analyzing LLM-as-a-Judge with chain-of-thought prompting effects.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes judgment distribution mean over mode

Incorporates risk aversion in preference derivation

Explores chain-of-thought prompting effects

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks