Short window attention enables long-term memorization

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work investigates how sliding-window length affects the joint modeling of short- and long-term context by soft attention and linear RNNs (xLSTM). Challenging the conventional assumption that larger windows inherently benefit long-range dependency capture, we propose SWAX—a hybrid architecture coupling sliding-window attention with xLSTM layers and incorporating stochastic variable-window training. Empirical results reveal that short attention windows do not impair—but rather significantly enhance—xLSTM’s capacity to model long-term memory, uncovering a complementary synergy between attentional windowing and recurrent structure. SWAX preserves fine-grained local modeling fidelity while strengthening global temporal representation, consistently outperforming standard windowed-attention baselines on both short- and long-context benchmarks. Our findings establish a novel architectural paradigm for attention-RNN hybrids and provide empirical validation for synergistic design principles in sequence modeling.

Technology Category

Application Category

📝 Abstract

Recent works show that hybrid architectures combining sliding window softmax attention layers with linear recurrent neural network (RNN) layers outperform both of these architectures taken separately. However, the impact of the window length and the interplay between softmax attention and linear RNN layers remain under-studied. In this work, we introduce SWAX, a hybrid architecture consisting of sliding-window attention and xLSTM linear RNN layers. A counter-intuitive finding with SWAX is that larger sliding windows do not improve the long-context performance. In fact, short window attention encourages the model to better train the long-term memory of the xLSTM, by relying less on the softmax attention mechanism for long context-retrieval. The issue with small sliding windows is that they are detrimental for short-context tasks, which could be solved with information from moderately larger sliding windows otherwise. Therefore, we train SWAX by stochastically changing the sliding window size, forcing the model to leverage both a longer context window and the xLSTM memory. SWAX trained with stochastic window sizes significantly outperforms regular window attention both on short and long-context problems.

Problem

Research questions and friction points this paper is trying to address.

Investigating hybrid architecture combining sliding window attention with linear RNN layers

Analyzing impact of window length on long-term memory performance

Solving short-context limitations through stochastic window size training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid architecture combines sliding-window attention and xLSTM layers

Short window attention enhances long-term memory training

Stochastic window sizes optimize both short and long contexts

🔎 Similar Papers

No similar papers found.