RATTENTION: Towards the Minimal Sliding Window Size in Local-Global Attention Models

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

Local-global attention mechanisms face a Pareto trade-off regarding window size: large windows preserve performance but yield diminishing returns for short-context modeling, whereas small windows improve efficiency at the cost of reduced long-range dependency capture. This paper proposes RATTENTION, the first architecture to synergistically integrate sliding-window local attention with a dedicated linear attention module, enabling precise modeling of dependencies beyond the local window—even under extremely small windows (e.g., 512 tokens). Our design overcomes inherent limitations of purely local attention, achieving full-attention-level performance for the first time at a 512-token window. RATTENTION attains state-of-the-art training throughput and demonstrates significantly enhanced long-context modeling capability on the RULER benchmark. Moreover, it substantially improves the efficiency–performance Pareto frontier in short-context regimes, offering both superior accuracy and computational efficiency.

Technology Category

Application Category

📝 Abstract

Local-global attention models have recently emerged as compelling alternatives to standard Transformers, promising improvements in both training and inference efficiency. However, the crucial choice of window size presents a Pareto tradeoff: larger windows maintain performance akin to full attention but offer minimal efficiency gains in short-context scenarios, while smaller windows can lead to performance degradation. Current models, such as Gemma2 and Mistral, adopt conservative window sizes (e.g., 4096 out of an 8192 pretraining length) to preserve performance. This work investigates strategies to shift this Pareto frontier, enabling local-global models to achieve efficiency gains even in short-context regimes. Our core motivation is to address the intrinsic limitation of local attention -- its complete disregard for tokens outside the defined window. We explore RATTENTION, a variant of local attention integrated with a specialized linear attention mechanism designed to capture information from these out-of-window tokens. Pretraining experiments at the 3B and 12B scales demonstrate that RATTENTION achieves a superior Pareto tradeoff between performance and efficiency. As a sweet spot, RATTENTION with a window size of just 512 consistently matches the performance of full-attention models across diverse settings. Furthermore, the recurrent nature inherent in the linear attention component of RATTENTION contributes to enhanced long-context performance, as validated on the RULER benchmark. Crucially, these improvements do not compromise training efficiency; thanks to a specialized kernel implementation and the reduced window size, RATTENTION maintains training speeds comparable to existing state-of-the-art approaches.

Problem

Research questions and friction points this paper is trying to address.

Optimizing window size in local-global attention models for efficiency.

Addressing performance degradation in small-window local attention models.

Enhancing long-context performance without sacrificing training efficiency.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines local attention with linear attention

Reduces window size to 512 tokens

Maintains performance with improved efficiency

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs