Low Rank for Rank: Uncertainty-Aware Task-Specific LLM Ranking under Sparse Pairwise Comparisons

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the challenge of obtaining stable and reliable task-specific large language model (LLM) rankings from sparse and imbalanced pairwise comparison data. The authors propose a low-rank modeling approach that treats the task–model capability matrix as having an intrinsic low-rank structure, thereby sharing information across tasks while preserving task-specific characteristics. They combine convex initialization with alternating minimization to estimate latent scores and, for the first time, establish an uncertainty quantification framework for task-specific LLM ranking. By employing a cross-fitting one-step debiased estimator, they derive asymptotically valid confidence intervals and extend the method to high-dimensional settings to support simultaneous confidence sets and Top-K inference. Experiments on both synthetic data and the Chatbot Arena benchmark demonstrate that the proposed method significantly outperforms independent Bradley–Terry models, achieving higher sample efficiency and better-calibrated ranking confidence intervals, especially under sparsity.

📝 Abstract

Pairwise human-preference platforms such as Chatbot Arena have become central to large language model (LLM) evaluation, yet reliable task-specific ranking remains challenging. Global leaderboards mask task heterogeneity, while ranking each fine-grained task independently is unstable under sparse, imbalanced comparisons. We propose a low-rank framework for task-specific LLM ranking from sparse pairwise comparisons, modeling the task-by-model ability matrix $Θ^\star \in \mathbb{R}^{d_t \times d_m}$ as low rank so that information is shared across related tasks while task-specific differences are preserved. We first develop a max-norm ($\ell_\infty$) accurate estimator for the latent scores, combining a convex initializer with alternating-minimization refinement, and prove task-wise top-$K$ recovery guarantees under sparse sampling. Our main contribution is an uncertainty quantification framework for task-specific ranking. We construct cross-fitted one-step debiased estimators for fixed score contrasts -- such as the task-specific ability gap between two models -- yielding asymptotically valid confidence intervals that attain the semiparametric efficiency bound. We then extend the inference to the high-dimensional ranking regime, where per-task ranks and top-$K$ membership are determined by many dependent score-gap hypotheses. Using Gaussian and multiplier-bootstrap calibration, we obtain simultaneous confidence sets for per-task ranks and valid top-$K$ membership tests across many tasks and models. Experiments on synthetic data and Chatbot Arena show that low-rank sharing improves sample efficiency over independent task-wise Bradley-Terry estimation and produces tighter, better-calibrated ranking certificates, with the largest gains in the sparse regime typical of real LLM benchmarks.

Problem

Research questions and friction points this paper is trying to address.

LLM ranking

sparse pairwise comparisons

task-specific evaluation

uncertainty quantification

low-rank modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

low-rank modeling

uncertainty quantification

pairwise comparisons