SCATR: Simple Calibrated Test-Time Ranking

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Existing lightweight confidence heuristics exhibit limited effectiveness in test-time ranking, while strong learned scorers incur prohibitive training and inference costs. This work proposes an efficient test-time ranking method that leverages a small set of calibration samples to learn an extremely lightweight scorer from the hidden representations of a large language model for Best-of-N response selection. Introducing only minimal additional parameters and computational overhead, the approach substantially improves ranking accuracy—achieving up to a 9% gain over current confidence-based methods on mathematical and code generation tasks. Compared to LoRA fine-tuning, it reduces parameter count by 8,000×, training latency by 150×, and inference latency by 1,000×, while matching or surpassing the performance of Process Reward Models (PRMs) with inference speeds up to 1,000 times faster.

Technology Category

Application Category

📝 Abstract

Test-time scaling (TTS) improves large language models (LLMs) by allocating additional compute at inference time. In practice, TTS is often achieved through parallel scaling: generating multiple candidate responses and selecting the best via a Best-of-N (BoN) strategy. Its effectiveness therefore hinges on the scoring function. Learned scorers such as process reward models (PRMs) can be strong, but they are expensive to train and run. Lightweight confidence heuristics based on token log-probabilities are much cheaper, yet we find that they often perform substantially worse. To improve on lightweight confidence heuristics without incurring the full cost of stronger learned scorers, we introduce SCATR, a simple and efficient BoN ranking method that learns a lightweight scorer from a small calibration set using hidden representations from the base model. Across coding and mathematical reasoning benchmarks, SCATR improves over prior confidence-based baselines by up to 9%. Relative to LoRA fine-tuning on the same calibration data, it achieves comparable accuracy with up to 8000x fewer trainable parameters and much lower compute, reducing training and inference latency by up to 150x and 1000x, respectively. SCATR is also competitive with strong PRM baselines, and in several settings improves accuracy by up to 7.8% on math and 4.2% on coding while enabling up to 1000x faster inference. Overall, SCATR offers a strong accuracy-efficiency trade-off for scalable test-time selection.

Problem

Research questions and friction points this paper is trying to address.

test-time scaling

ranking

large language models

efficiency

scoring function

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time scaling

lightweight scorer

calibration