The Art of Scaling Test-Time Compute for Large Language Models

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Prior work lacks systematic empirical comparisons of test-time scaling (TTS) strategies, and the influence of model architecture and task difficulty on strategy efficacy remains unclear. Method: We conduct the first unified evaluation across eight open-source large language models and four reasoning benchmarks, assessing eight mainstream TTS strategies—including chain-of-thought, self-consistency, and search-based methods—under a consistent 3-billion-token generation budget. Contribution/Results: We identify a dichotomy in reasoning trace quality—short-horizon versus long-horizon—and find that optimal performance monotonically improves with computational budget, yet no single strategy dominates universally. Based on these findings, we propose a dynamic strategy selection framework guided by problem difficulty, model capability, and compute budget. Our study delivers reproducible empirical evidence and actionable guidelines for optimizing LLM inference efficiency.

Technology Category

Application Category

📝 Abstract

Test-time scaling (TTS) -- the dynamic allocation of compute during inference -- is a promising direction for improving reasoning in large language models (LLMs). However, a systematic comparison of well-known TTS strategies under identical conditions is missing, and the influence of model type and problem difficulty on performance remains unclear. To address these gaps, we conduct the first large-scale study of TTS, spanning over thirty billion tokens generated using eight open-source LLMs (7B to 235B parameters), across four reasoning datasets. We observe three consistent trends: (1) no single TTS strategy universally dominates; (2) reasoning models exhibit distinct trace-quality patterns across problem difficulty and trace length, forming short-horizon and long-horizon categories; and (3) for a given model type, the optimal TTS performance scales monotonically with compute budget. Based on these insights, we provide a practical recipe for selecting the best TTS strategy, considering problem difficulty, model type, and compute budget, providing a practical guide to effective inference-time scaling.

Problem

Research questions and friction points this paper is trying to address.

Systematically compare test-time scaling strategies under identical conditions

Clarify influence of model type and problem difficulty on performance

Provide practical guide for selecting optimal test-time scaling strategy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic compute allocation during inference

Large-scale comparison of test-time scaling strategies

Optimal strategy selection based on model and difficulty

🔎 Similar Papers

No similar papers found.