🤖 AI Summary
Prior work lacks systematic empirical comparisons of test-time scaling (TTS) strategies, and the influence of model architecture and task difficulty on strategy efficacy remains unclear.
Method: We conduct the first unified evaluation across eight open-source large language models and four reasoning benchmarks, assessing eight mainstream TTS strategies—including chain-of-thought, self-consistency, and search-based methods—under a consistent 3-billion-token generation budget.
Contribution/Results: We identify a dichotomy in reasoning trace quality—short-horizon versus long-horizon—and find that optimal performance monotonically improves with computational budget, yet no single strategy dominates universally. Based on these findings, we propose a dynamic strategy selection framework guided by problem difficulty, model capability, and compute budget. Our study delivers reproducible empirical evidence and actionable guidelines for optimizing LLM inference efficiency.
📝 Abstract
Test-time scaling (TTS) -- the dynamic allocation of compute during inference -- is a promising direction for improving reasoning in large language models (LLMs). However, a systematic comparison of well-known TTS strategies under identical conditions is missing, and the influence of model type and problem difficulty on performance remains unclear. To address these gaps, we conduct the first large-scale study of TTS, spanning over thirty billion tokens generated using eight open-source LLMs (7B to 235B parameters), across four reasoning datasets. We observe three consistent trends: (1) no single TTS strategy universally dominates; (2) reasoning models exhibit distinct trace-quality patterns across problem difficulty and trace length, forming short-horizon and long-horizon categories; and (3) for a given model type, the optimal TTS performance scales monotonically with compute budget. Based on these insights, we provide a practical recipe for selecting the best TTS strategy, considering problem difficulty, model type, and compute budget, providing a practical guide to effective inference-time scaling.