Sliding Window Attention Adaptation

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address the performance degradation of full-attention (FA) pretrained large language models in long-context reasoning—caused by architectural mismatch with sliding-window attention (SWA)—this paper proposes, for the first time, a SWA-adaptation framework that requires no re-pretraining. Our method systematically integrates five synergistic strategies: SWA during the prefill phase, sink token retention, interleaved FA/SWA layers, chain-of-thought-guided prompting, and lightweight supervised fine-tuning—enabling plug-and-play adaptation of existing FA models. Evaluated on multiple long-text benchmarks, the adapted models fully recover the original FA performance while reducing inference memory consumption by 40% and latency by 35%. This work delivers a scalable, hardware- and precision-agnostic solution for efficient long-context deployment of large language models.

Technology Category

Application Category

📝 Abstract

The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation

Problem

Research questions and friction points this paper is trying to address.

Adapts full-attention LLMs to sliding window attention efficiently

Reduces quadratic complexity to linear for long-context inference

Mitigates performance degradation from training-inference mismatch

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts full-attention models to sliding window

Combines five methods for effective adaptation

Recovers long-context performance with linear complexity

🔎 Similar Papers

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention