Demystifying LLM-as-a-Judge: Analytically Tractable Model for Inference-Time Scaling

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates the impact mechanism of inference-time compute scaling—such as best-of-$k$ sampling and temperature tuning—on generalization performance in the LLM-as-a-judge setting. We propose the first analytically tractable theory of inference-time scaling, modeling discriminative inference as reward-weighted Bayesian linear regression and deriving closed-form expressions for prediction mean, variance, and generalization error in the high-dimensional limit. Our theoretical analysis establishes: (1) an optimal trade-off between sampling count $k$ and temperature; (2) a $Theta(1/k^2)$ decay rate of generalization error; (3) a rigorously characterized regime where inference scaling strictly dominates data scaling; and (4) finite optimal $k$ under reward mismatch, with diminishing returns as task difficulty increases. Empirical results validate the predicted error decay law and temperature optimality.

Technology Category

Application Category

📝 Abstract

Recent developments in large language models have shown advantages in reallocating a notable share of computational resource from training time to inference time. However, the principles behind inference time scaling are not well understood. In this paper, we introduce an analytically tractable model of inference-time scaling: Bayesian linear regression with a reward-weighted sampler, where the reward is determined from a linear model, modeling LLM-as-a-judge scenario. We study this problem in the high-dimensional regime, where the deterministic equivalents dictate a closed-form expression for the posterior predictive mean and variance. We analyze the generalization error when training data are sampled from a teacher model. We draw $k$ inference-time samples and select via softmax at a temperature applied to a quadratic reward. When the reward is not too different from the teacher, the generalization error decreases monotonically with increasing inference time samples $k$. However, the specific reward that optimizes inference-time selection generally differs from the teacher. In contrast, substantial reward misspecification induces a finite optimal $k$ beyond which more sampling can increase the generalization error. For fixed $k$, there exists an optimal sampling temperature. We experimentally verify these facts in large language model inference with an additional large language model as a judge. In the "best-of-$k$" limit with the teacher as reward, we theoretically show that the generalization error decays as $Θ(1/k^2)$ and determine the leading coefficient via extreme value theory. These formulas delineate domains where scaling inference-time computation is provably preferable to collecting more data. Finally, we demonstrate that when task difficulty increases, the previously mentioned advantage of inference-time compute degrades.

Problem

Research questions and friction points this paper is trying to address.

Develops an analytically tractable model to study inference-time scaling in large language models.

Analyzes generalization error when selecting outputs via reward-weighted sampling during inference.

Identifies conditions where increasing inference-time samples improves or degrades model performance.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian linear regression with reward-weighted sampler

High-dimensional regime with deterministic equivalents

Optimal sampling temperature for fixed k

🔎 Similar Papers

No similar papers found.

Authors to Follow