Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning

📅 2025-08-09

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

To address the high computational overhead of large reasoning models under short prompts caused by redundant token generation, this paper proposes a training-free dynamic sparse attention mechanism. The method introduces a unified token-ranking strategy across attention heads, integrating local head outputs with context-aware recency information to enable efficient, lossless token pruning during decoding. Grounded in the principle of global locality, it avoids error accumulation from per-head independent pruning—thereby significantly mitigating accuracy degradation in long-range generation. Experiments across diverse reasoning tasks demonstrate that the approach maintains or improves accuracy while achieving 1.1× end-to-end speedup over full attention and 1.13× over existing sparse methods. Notably, it retains original model performance using only 50% of the tokens.

Technology Category

Application Category

📝 Abstract

Large reasoning models achieve strong performance through test-time scaling but incur substantial computational overhead, particularly from excessive token generation when processing short input prompts. While sparse attention mechanisms can reduce latency and memory usage, existing approaches suffer from significant accuracy degradation due to accumulated errors during long-generation reasoning. These methods generally require either high token retention rates or expensive retraining. We introduce LessIsMore, a training-free sparse attention mechanism for reasoning tasks, which leverages global attention patterns rather than relying on traditional head-specific local optimizations. LessIsMore aggregates token selections from local attention heads with recent contextual information, enabling unified cross-head token ranking for future decoding layers. This unified selection improves generalization and efficiency by avoiding the need to maintain separate token subsets per head. Evaluation across diverse reasoning tasks and benchmarks shows that LessIsMore preserves -- and in some cases improves -- accuracy while achieving a $1.1 imes$ average decoding speed-up compared to full attention. Moreover, LessIsMore attends to $2 imes$ fewer tokens without accuracy loss, achieving a $1.13 imes$ end-to-end speed-up compared to existing sparse attention methods.

Problem

Research questions and friction points this paper is trying to address.

Reduce computational overhead in large reasoning models

Improve sparse attention accuracy without retraining

Enhance efficiency in token generation and selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free sparse attention mechanism

Global attention patterns for efficiency

Unified cross-head token ranking

🔎 Similar Papers

From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency