Ranking Reasoning LLMs under Test-Time Scaling

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of systematic investigation into reliable ranking of reasoning-based large language models under test-time scaling scenarios. It formally defines the dense benchmark ranking problem in this setting and introduces Scorio, a library integrating diverse statistical ranking techniques—including pairwise comparison models, item response theory (IRT), voting rules, and graph-based methods. Experiments on four olympiad-level mathematical benchmarks demonstrate that most methods achieve high agreement with the Bayesian gold standard (Kendall’s τ_b ranging from 0.93 to 0.95), with the best-performing method in a single trial attaining τ_b ≈ 0.86. Incorporating greedy decoding as a prior significantly reduces ranking variance. This work provides a robust and reproducible ranking framework for evaluating the reasoning capabilities of large language models.

Technology Category

Application Category

📝 Abstract
Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across $20$ reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to $N=80$ trials), most full-trial rankings agree closely with the Bayesian gold standard $\mathrm{Bayes}_{\mathcal{U}}@80$ (mean Kendall's $τ_b = 0.93$--$0.95$), and $19$--$34$ methods recover exactly the same ordering. In the single-trial regime, the best methods reach $τ_b \approx 0.86$. Using greedy decoding as an empirical prior ($\mathrm{Bayes}_{\mathbf{R}_0}@N$) reduces variance at $N=1$ by $16$--$52\%$, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at https://github.com/mohsenhariri/scorio.
Problem

Research questions and friction points this paper is trying to address.

test-time scaling
ranking
reasoning LLMs
benchmark evaluation
statistical ranking
Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time scaling
statistical ranking
reasoning LLMs
item response theory
Scorio
🔎 Similar Papers
No similar papers found.
M
Mohsen Hariri
Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH, USA
M
Michael Hinczewski
Department of Physics, Case Western Reserve University, Cleveland, OH, USA
Jing Ma
Jing Ma
Case Western Reserve University
Trustworthy MLcausal inferencegraph miningmachine learningdata mining
Vipin Chaudhary
Vipin Chaudhary
Case Western Reserve University
High Performance ComputingArtificial IntelligenceData ScienceComputer VisionQuantum Computing