The Art of Scaling Test-Time Compute for Large Language Models

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior work lacks systematic empirical comparisons of test-time scaling (TTS) strategies, and the influence of model architecture and task difficulty on strategy efficacy remains unclear. Method: We conduct the first unified evaluation across eight open-source large language models and four reasoning benchmarks, assessing eight mainstream TTS strategies—including chain-of-thought, self-consistency, and search-based methods—under a consistent 3-billion-token generation budget. Contribution/Results: We identify a dichotomy in reasoning trace quality—short-horizon versus long-horizon—and find that optimal performance monotonically improves with computational budget, yet no single strategy dominates universally. Based on these findings, we propose a dynamic strategy selection framework guided by problem difficulty, model capability, and compute budget. Our study delivers reproducible empirical evidence and actionable guidelines for optimizing LLM inference efficiency.

Technology Category

Application Category

📝 Abstract
Test-time scaling (TTS) -- the dynamic allocation of compute during inference -- is a promising direction for improving reasoning in large language models (LLMs). However, a systematic comparison of well-known TTS strategies under identical conditions is missing, and the influence of model type and problem difficulty on performance remains unclear. To address these gaps, we conduct the first large-scale study of TTS, spanning over thirty billion tokens generated using eight open-source LLMs (7B to 235B parameters), across four reasoning datasets. We observe three consistent trends: (1) no single TTS strategy universally dominates; (2) reasoning models exhibit distinct trace-quality patterns across problem difficulty and trace length, forming short-horizon and long-horizon categories; and (3) for a given model type, the optimal TTS performance scales monotonically with compute budget. Based on these insights, we provide a practical recipe for selecting the best TTS strategy, considering problem difficulty, model type, and compute budget, providing a practical guide to effective inference-time scaling.
Problem

Research questions and friction points this paper is trying to address.

Systematically compare test-time scaling strategies under identical conditions
Clarify influence of model type and problem difficulty on performance
Provide practical guide for selecting optimal test-time scaling strategy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic compute allocation during inference
Large-scale comparison of test-time scaling strategies
Optimal strategy selection based on model and difficulty
🔎 Similar Papers
No similar papers found.
A
Aradhye Agarwal
Microsoft Research
Ayan Sengupta
Ayan Sengupta
Indian Institute of Technology Delhi
Natural Language ProcessingMeta LearningReinforcement Learning
T
Tanmoy Chakraborty
Indian Institute of Technology Delhi