🤖 AI Summary
This work addresses the challenge of noisy, sparse, and unevenly sampled pairwise human judgments that undermine reliable uncertainty quantification in large language model (LLM) evaluation benchmarks. Framing LLM assessment as a low-rank tensor completion problem with structured observations and non-uniform sampling under a Bradley–Terry–Luce-type model, the authors propose a unified semi-parametric inference framework. Key innovations include a score whitening technique to overcome inferential bottlenecks caused by anisotropy in the information operator, and the construction of an efficient debiased estimator that achieves asymptotic normality. The resulting method enables stable inference at optimal sample complexity and supports efficient, reliable uncertainty quantification for both linear and nonlinear functionals—such as ability gaps and win probabilities—thereby offering a principled foundation for robust LLM benchmarking.
📝 Abstract
Large language model (LLM) evaluation platforms increasingly rely on pairwise human judgments. These data are noisy, sparse, and non-uniform, yet leaderboards are reported with limited uncertainty quantification. We study this as semiparametric inference for a low-rank latent score tensor observed through pairwise comparisons under Bradley-Terry-Luce-type models. This places LLM evaluation in a new tensor completion setting with structured observations, non-uniform sampling, and pairwise contrasts. Our target is a smooth functional $ψ(T^\star)$, including linear estimands such as ability gaps and nonlinear ones such as win probabilities. We derive the information operator on the low-rank tangent space, the efficient influence function, and the semiparametric efficiency bound, then construct a one-step debiased estimator with asymptotic normality. A central challenge is that the information operator is anisotropic and does not commute with the tangent-space projection, creating a bottleneck absent from isotropic models. We introduce a score-whitening method that equalizes local Fisher information and restores stable inference at the optimal sample-complexity scale. Our results provide a principled framework for uncertainty quantification in LLM evaluation and more broadly for inference on low-rank structures from pairwise data.