MUR: Momentum Uncertainty guided Reasoning for Large Language Models

📅 2025-07-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high computational redundancy and the lack of efficient, adaptive test-time control mechanisms for large language models (LLMs) in reasoning-intensive tasks, this paper proposes a momentum-uncertainty-guided inference scheduling method. The approach introduces a gamma-control hyperparameter and dynamically allocates inference resources by leveraging step-wise uncertainty accumulation and a physics-inspired momentum mechanism—requiring no additional training while ensuring stable, low-bias test-time scaling control. We provide theoretical guarantees on convergence and favorable bias-variance trade-offs. Empirically, our method reduces average computational cost by over 50% across multiple challenging reasoning benchmarks, while simultaneously improving accuracy by 0.62–3.37 percentage points—outperforming existing test-time scaling strategies significantly.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide LLM test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by over 50% on average while improving accuracy by 0.62-3.37%.
Problem

Research questions and friction points this paper is trying to address.

Optimizing reasoning efficiency in Large Language Models
Reducing redundant computations during test-time scaling
Dynamically allocating thinking budgets to critical reasoning steps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic budget allocation via momentum uncertainty
Gamma-control for flexible inference tuning
Reduces computation while improving accuracy
🔎 Similar Papers
No similar papers found.