Distribution-Calibrated Inference time compute for Thinking LLM-as-a-Judge

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Pairwise preference evaluation using chain-of-thought (CoT) large language models (LLMs) suffers from high single-sample noise, and conventional aggregation rules—such as majority voting or soft self-consistency—yield inconsistent outcomes when ties are permitted. Method: This paper proposes a distribution-calibrated, inference-time aggregation method grounded in the Bradley–Terry–Davidson (BTD) model. It explicitly disentangles preference polarity from decisiveness to distinguish weak advantages from strong consensus, integrates *n*-shot independent CoT-scoring sampling, inference-time optimization, and three-way preference modeling. Contribution/Results: Evaluated across multiple benchmarks, the method significantly reduces mean absolute error (MAE) and improves pairwise accuracy. When compared against human consensus labels, its performance matches or exceeds that of individual human annotators—demonstrating both robustness and human-level reliability in preference judgment.

Technology Category

Application Category

📝 Abstract
Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking-rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley-Terry-Davidson formulation on rating counts, leveraging both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus. Across various evaluation benchmarks, our approach consistently reduces MAE and increases pairwise accuracy versus standard baselines, and when evaluated against human-consensus meta-labels, matches or exceeds individual human raters. These results show that carefully allocating ITC and aggregating with distribution-aware methods turns noisy individual model judgments into reliable ratings for evaluation.
Problem

Research questions and friction points this paper is trying to address.

Reduces noise in LLM pairwise preference judgments
Addresses inconsistency in aggregation rules with ties
Improves evaluation accuracy via distribution-calibrated inference-time compute
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distribution-calibrated aggregation for preference modeling
Leveraging polarity and decisiveness in rating counts
Allocating inference-time compute to improve evaluation reliability
🔎 Similar Papers
No similar papers found.