🤖 AI Summary
This work addresses the challenge of obtaining stable and reliable task-specific large language model (LLM) rankings from sparse and imbalanced pairwise comparison data. The authors propose a low-rank modeling approach that treats the task–model capability matrix as having an intrinsic low-rank structure, thereby sharing information across tasks while preserving task-specific characteristics. They combine convex initialization with alternating minimization to estimate latent scores and, for the first time, establish an uncertainty quantification framework for task-specific LLM ranking. By employing a cross-fitting one-step debiased estimator, they derive asymptotically valid confidence intervals and extend the method to high-dimensional settings to support simultaneous confidence sets and Top-K inference. Experiments on both synthetic data and the Chatbot Arena benchmark demonstrate that the proposed method significantly outperforms independent Bradley–Terry models, achieving higher sample efficiency and better-calibrated ranking confidence intervals, especially under sparsity.
📝 Abstract
Pairwise human-preference platforms such as Chatbot Arena have become central to large language model (LLM) evaluation, yet reliable task-specific ranking remains challenging. Global leaderboards mask task heterogeneity, while ranking each fine-grained task independently is unstable under sparse, imbalanced comparisons. We propose a low-rank framework for task-specific LLM ranking from sparse pairwise comparisons, modeling the task-by-model ability matrix $Θ^\star \in \mathbb{R}^{d_t \times d_m}$ as low rank so that information is shared across related tasks while task-specific differences are preserved. We first develop a max-norm ($\ell_\infty$) accurate estimator for the latent scores, combining a convex initializer with alternating-minimization refinement, and prove task-wise top-$K$ recovery guarantees under sparse sampling. Our main contribution is an uncertainty quantification framework for task-specific ranking. We construct cross-fitted one-step debiased estimators for fixed score contrasts -- such as the task-specific ability gap between two models -- yielding asymptotically valid confidence intervals that attain the semiparametric efficiency bound. We then extend the inference to the high-dimensional ranking regime, where per-task ranks and top-$K$ membership are determined by many dependent score-gap hypotheses. Using Gaussian and multiplier-bootstrap calibration, we obtain simultaneous confidence sets for per-task ranks and valid top-$K$ membership tests across many tasks and models. Experiments on synthetic data and Chatbot Arena show that low-rank sharing improves sample efficiency over independent task-wise Bradley-Terry estimation and produces tighter, better-calibrated ranking certificates, with the largest gains in the sparse regime typical of real LLM benchmarks.