Sliding Window Attention Adaptation

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the performance degradation of full-attention (FA) pretrained large language models in long-context reasoning—caused by architectural mismatch with sliding-window attention (SWA)—this paper proposes, for the first time, a SWA-adaptation framework that requires no re-pretraining. Our method systematically integrates five synergistic strategies: SWA during the prefill phase, sink token retention, interleaved FA/SWA layers, chain-of-thought-guided prompting, and lightweight supervised fine-tuning—enabling plug-and-play adaptation of existing FA models. Evaluated on multiple long-text benchmarks, the adapted models fully recover the original FA performance while reducing inference memory consumption by 40% and latency by 35%. This work delivers a scalable, hardware- and precision-agnostic solution for efficient long-context deployment of large language models.

Technology Category

Application Category

📝 Abstract
The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation
Problem

Research questions and friction points this paper is trying to address.

Adapts full-attention LLMs to sliding window attention efficiently
Reduces quadratic complexity to linear for long-context inference
Mitigates performance degradation from training-inference mismatch
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts full-attention models to sliding window
Combines five methods for effective adaptation
Recovers long-context performance with linear complexity
🔎 Similar Papers
No similar papers found.
Yijiong Yu
Yijiong Yu
Master Student, Tsinghua University
Natural Language ProcessingMachine Learning
J
Jiale Liu
Penn State University
Qingyun Wu
Qingyun Wu
The Pennsylvania State University
Agentic AI
H
Huazheng Wang
Oregon State University
J
Ji Pei
DeepSolution