Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper systematically investigates the impact mechanism of Test-Time Scaling (TTS) on the reasoning performance of large language models (LLMs). Addressing the limited generalizability of existing TTS strategies across model scales, process reward model (PRM) designs, and task difficulty levels, we propose three key techniques: process supervision, adaptive sampling, and dynamic computational budget allocation. Extensive multi-model ablation studies are conducted on the MATH-500 and AIME24 benchmarks. Our work is the first to reveal that optimal TTS strategies are highly contingent on model scale, PRM architecture, and problem difficulty. Empirically, TTS-adapted smaller models substantially outperform significantly larger counterparts: a 1B-parameter model surpasses a 405B model on MATH-500; a 0.5B model exceeds GPT-4o; and a 7B model outperforms both o1 and DeepSeek-R1—all while achieving higher inference efficiency. These findings challenge the conventional “compute-as-capability” paradigm in LLM reasoning.

Technology Category

Application Category

📝 Abstract
Test-Time Scaling (TTS) is an important method for improving the performance of Large Language Models (LLMs) by using additional computation during the inference phase. However, current studies do not systematically analyze how policy models, Process Reward Models (PRMs), and problem difficulty influence TTS. This lack of analysis limits the understanding and practical use of TTS methods. In this paper, we focus on two core questions: (1) What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? (2) To what extent can extended computation improve the performance of LLMs on complex tasks, and can smaller language models outperform larger ones through this approach? Through comprehensive experiments on MATH-500 and challenging AIME24 tasks, we have the following observations: (1) The compute-optimal TTS strategy is highly dependent on the choice of policy model, PRM, and problem difficulty. (2) With our compute-optimal TTS strategy, extremely small policy models can outperform larger models. For example, a 1B LLM can exceed a 405B LLM on MATH-500. Moreover, on both MATH-500 and AIME24, a 0.5B LLM outperforms GPT-4o, a 3B LLM surpasses a 405B LLM, and a 7B LLM beats o1 and DeepSeek-R1, while with higher inference efficiency. These findings show the significance of adapting TTS strategies to the specific characteristics of each task and model and indicate that TTS is a promising approach for enhancing the reasoning abilities of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Optimize Test-Time Scaling for LLMs
Analyze influence of policy models, PRMs, problem difficulty
Enhance smaller LLMs' performance over larger ones
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes Test-Time Scaling strategy
Enhances small LLMs' performance
Improves inference efficiency significantly
🔎 Similar Papers
No similar papers found.
R
Runze Liu
Shanghai AI Laboratory, Tsinghua University
Junqi Gao
Junqi Gao
Shanghai AI Lab, 哈尔滨工业大学
Deep LearningGenerative ModelsContinual Learning
J
Jian Zhao
BUPT
Kaiyan Zhang
Kaiyan Zhang
Tsinghua University
Foundation ModelCollective IntelligenceScientific Intelligence
Xiu Li
Xiu Li
Bytedance Seed
Computer VisionComputer Graphics3D Vision
B
Biqing Qi
Shanghai AI Laboratory
W
Wanli Ouyang
Shanghai AI Laboratory
B
Bowen Zhou
Shanghai AI Laboratory, Tsinghua University