Distribution-Calibrated Inference time compute for Thinking LLM-as-a-Judge

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Pairwise preference evaluation using chain-of-thought (CoT) large language models (LLMs) suffers from high single-sample noise, and conventional aggregation rules—such as majority voting or soft self-consistency—yield inconsistent outcomes when ties are permitted. Method: This paper proposes a distribution-calibrated, inference-time aggregation method grounded in the Bradley–Terry–Davidson (BTD) model. It explicitly disentangles preference polarity from decisiveness to distinguish weak advantages from strong consensus, integrates *n*-shot independent CoT-scoring sampling, inference-time optimization, and three-way preference modeling. Contribution/Results: Evaluated across multiple benchmarks, the method significantly reduces mean absolute error (MAE) and improves pairwise accuracy. When compared against human consensus labels, its performance matches or exceeds that of individual human annotators—demonstrating both robustness and human-level reliability in preference judgment.

Technology Category

Application Category

📝 Abstract

Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking-rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley-Terry-Davidson formulation on rating counts, leveraging both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus. Across various evaluation benchmarks, our approach consistently reduces MAE and increases pairwise accuracy versus standard baselines, and when evaluated against human-consensus meta-labels, matches or exceeds individual human raters. These results show that carefully allocating ITC and aggregating with distribution-aware methods turns noisy individual model judgments into reliable ratings for evaluation.

Problem

Research questions and friction points this paper is trying to address.

Reduces noise in LLM pairwise preference judgments

Addresses inconsistency in aggregation rules with ties

Improves evaluation accuracy via distribution-calibrated inference-time compute

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distribution-calibrated aggregation for preference modeling

Leveraging polarity and decisiveness in rating counts

Allocating inference-time compute to improve evaluation reliability

🔎 Similar Papers

No similar papers found.