🤖 AI Summary
This work addresses the evaluation bias in LLM-as-a-Judge frameworks arising from reliance solely on greedy decoding outputs. We propose a novel reasoning paradigm grounded in the full judgment probability distribution. Methodologically: (1) we systematically validate that mean aggregation outperforms mode aggregation for distribution summarization; (2) we design a risk-averse distribution aggregation strategy to enhance robustness; and (3) we empirically demonstrate that chain-of-thought (CoT) prompting compresses the judgment distribution and degrades discriminative performance. Experiments span pointwise, pairwise, and listwise evaluation paradigms. Results show that mean-based aggregation significantly surpasses greedy decoding baselines, and the risk-averse strategy further improves both accuracy and stability. To our knowledge, this is the first work to deeply integrate probabilistic distribution modeling into LLM-based evaluation, establishing a new paradigm for trustworthy and interpretable automated text assessment.
📝 Abstract
Using language models to scalably approximate human preferences on text quality (LLM-as-a-judge) has become a standard practice applicable to many tasks. A judgment is often extracted from the judge's textual output alone, typically with greedy decoding. However, LLM judges naturally provide distributions over judgment tokens, inviting a breadth of inference methods for extracting fine-grained preferences. We find that taking the mean of the judgment distribution consistently outperforms taking the mode (i.e. greedy decoding) in all evaluation settings (i.e. pointwise, pairwise, and listwise). We further explore novel methods of deriving preferences from judgment distributions, and find that methods incorporating risk aversion often improve performance. Lastly, we analyze LLM-as-a-judge paired with chain-of-thought (CoT) prompting, showing that CoT can collapse the spread of the judgment distribution, often harming performance. Our findings suggest leveraging distributional output can improve LLM-as-a-judge, as opposed to using the text interface alone.