s1: Simple test-time scaling

📅 2025-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Language models suffer from suboptimal dynamic allocation of computational budget during inference, limiting their performance on complex mathematical reasoning tasks. Method: We propose *Budget Forcing*, a novel paradigm that explicitly controls reasoning length and enables self-verification by introducing a controllable *Wait* token to dynamically extend or truncate the generation process. Our approach combines supervised fine-tuning (on Qwen2.5-32B-Instruct), reasoning trajectory distillation, and test-time budget scheduling, trained on a high-quality, compact reasoning dataset (s1K, 1k samples). Contribution/Results: On MATH and AIME24 benchmarks, Budget Forcing achieves 57% accuracy on AIME24 using only test-time intervention—surpassing o1-preview by up to 27%—and outperforms o1-preview’s mathematical reasoning capability despite training solely on s1K. All code, data, and models are publicly released.

Technology Category

Application Category

📝 Abstract
Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending"Wait"multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1 exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1 with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1.
Problem

Research questions and friction points this paper is trying to address.

Language Model
Computational Budget
Mathematical Problem Solving
Innovation

Methods, ideas, or system contributions that make the work stand out.

budget forcing
test-time scaling
reasoning performance enhancement
🔎 Similar Papers
2024-10-02ACM Conference on Recommender SystemsCitations: 0