Ranking Reasoning LLMs under Test-Time Scaling

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This study addresses the lack of systematic investigation into reliable ranking of reasoning-based large language models under test-time scaling scenarios. It formally defines the dense benchmark ranking problem in this setting and introduces Scorio, a library integrating diverse statistical ranking techniques—including pairwise comparison models, item response theory (IRT), voting rules, and graph-based methods. Experiments on four olympiad-level mathematical benchmarks demonstrate that most methods achieve high agreement with the Bayesian gold standard (Kendall’s τ_b ranging from 0.93 to 0.95), with the best-performing method in a single trial attaining τ_b ≈ 0.86. Incorporating greedy decoding as a prior significantly reduces ranking variance. This work provides a robust and reproducible ranking framework for evaluating the reasoning capabilities of large language models.

Technology Category

Application Category

📝 Abstract

Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across $20$ reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to $N=80$ trials), most full-trial rankings agree closely with the Bayesian gold standard $\mathrm{Bayes}_{\mathcal{U}}@80$ (mean Kendall's $τ_b = 0.93$--$0.95$), and $19$--$34$ methods recover exactly the same ordering. In the single-trial regime, the best methods reach $τ_b \approx 0.86$. Using greedy decoding as an empirical prior ($\mathrm{Bayes}_{\mathbf{R}_0}@N$) reduces variance at $N=1$ by $16$--$52\%$, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at https://github.com/mohsenhariri/scorio.

Problem

Research questions and friction points this paper is trying to address.

test-time scaling

ranking

reasoning LLMs

benchmark evaluation

statistical ranking

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time scaling

statistical ranking

reasoning LLMs