First Finish Search: Efficient Test-Time Scaling in Large Language Models

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address inefficient computational allocation during large language model (LLM) inference, this paper proposes a training-free, parallelized decoding strategy. It initiates *n* independent inference paths concurrently and adopts a “first-finish-first-return” policy—leveraging the empirical prior that shorter decoding paths are more likely to yield correct outputs—where the first path to terminate delivers the final prediction. The method requires no fine-tuning, avoids long-trajectory sampling or majority voting, and is theoretically grounded with provable efficacy and well-characterized applicability bounds. Integrated with parallel sampling, adaptive early termination, and test-time scaling (TTS), it boosts DeepSeek-R1’s accuracy on the AIME dataset by 15 percentage points to 82.23%, approaching the performance of o4-mini, while substantially reducing token consumption and end-to-end latency.

Technology Category

Application Category

📝 Abstract
Test-time scaling (TTS), which involves dynamic allocation of compute during inference, offers a promising way to improve reasoning in large language models. While existing TTS methods work well, they often rely on long decoding paths or require a large number of samples to be generated, increasing the token usage and inference latency. We observe the surprising fact that for reasoning tasks, shorter traces are much more likely to be correct than longer ones. Motivated by this, we introduce First Finish Search (FFS), a training-free parallel decoding strategy that launches $n$ independent samples and returns as soon as any one completes. We evaluate FFS alongside simple decoding, beam search, majority voting, and budget forcing on four reasoning models (DeepSeek-R1, R1-Distill-Qwen-32B, QwQ-32B and Phi-4-Reasoning-Plus) and across four datasets (AIME24, AIME25-I, AIME25-II and GPQA Diamond). With DeepSeek-R1, FFS achieves $82.23%$ accuracy on the AIME datasets, a $15%$ improvement over DeepSeek-R1's standalone accuracy, nearly matching OpenAI's o4-mini performance. Our theoretical analysis explains why stopping at the shortest trace is likely to yield a correct answer and identifies the conditions under which early stopping may be suboptimal. The elegance and simplicity of FFS demonstrate that straightforward TTS strategies can perform remarkably well, revealing the untapped potential of simple approaches at inference time.
Problem

Research questions and friction points this paper is trying to address.

Dynamic compute allocation for efficient LLM inference
Reducing token usage and latency in reasoning tasks
Improving accuracy via early stopping of shortest traces
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic compute allocation for efficient inference
Training-free parallel decoding with early stopping
Shorter traces prioritized for higher accuracy
🔎 Similar Papers
No similar papers found.
A
Aradhye Agarwal
Indian Institute of Technology Delhi
Ayan Sengupta
Ayan Sengupta
Indian Institute of Technology Delhi
Natural Language ProcessingMeta LearningReinforcement Learning
T
Tanmoy Chakraborty
Indian Institute of Technology Delhi