MUR: Momentum Uncertainty guided Reasoning for Large Language Models

📅 2025-07-20

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address high computational redundancy and the lack of efficient, adaptive test-time control mechanisms for large language models (LLMs) in reasoning-intensive tasks, this paper proposes a momentum-uncertainty-guided inference scheduling method. The approach introduces a gamma-control hyperparameter and dynamically allocates inference resources by leveraging step-wise uncertainty accumulation and a physics-inspired momentum mechanism—requiring no additional training while ensuring stable, low-bias test-time scaling control. We provide theoretical guarantees on convergence and favorable bias-variance trade-offs. Empirically, our method reduces average computational cost by over 50% across multiple challenging reasoning benchmarks, while simultaneously improving accuracy by 0.62–3.37 percentage points—outperforming existing test-time scaling strategies significantly.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide LLM test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by over 50% on average while improving accuracy by 0.62-3.37%.

Problem

Research questions and friction points this paper is trying to address.

Optimizing reasoning efficiency in Large Language Models

Reducing redundant computations during test-time scaling

Dynamically allocating thinking budgets to critical reasoning steps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic budget allocation via momentum uncertainty

Gamma-control for flexible inference tuning

Reduces computation while improving accuracy

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting