🤖 AI Summary
To address high computational redundancy and the lack of efficient, adaptive test-time control mechanisms for large language models (LLMs) in reasoning-intensive tasks, this paper proposes a momentum-uncertainty-guided inference scheduling method. The approach introduces a gamma-control hyperparameter and dynamically allocates inference resources by leveraging step-wise uncertainty accumulation and a physics-inspired momentum mechanism—requiring no additional training while ensuring stable, low-bias test-time scaling control. We provide theoretical guarantees on convergence and favorable bias-variance trade-offs. Empirically, our method reduces average computational cost by over 50% across multiple challenging reasoning benchmarks, while simultaneously improving accuracy by 0.62–3.37 percentage points—outperforming existing test-time scaling strategies significantly.
📝 Abstract
Large Language Models (LLMs) have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide LLM test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by over 50% on average while improving accuracy by 0.62-3.37%.