🤖 AI Summary
To address the performance degradation of full-attention (FA) pretrained large language models in long-context reasoning—caused by architectural mismatch with sliding-window attention (SWA)—this paper proposes, for the first time, a SWA-adaptation framework that requires no re-pretraining. Our method systematically integrates five synergistic strategies: SWA during the prefill phase, sink token retention, interleaved FA/SWA layers, chain-of-thought-guided prompting, and lightweight supervised fine-tuning—enabling plug-and-play adaptation of existing FA models. Evaluated on multiple long-text benchmarks, the adapted models fully recover the original FA performance while reducing inference memory consumption by 40% and latency by 35%. This work delivers a scalable, hardware- and precision-agnostic solution for efficient long-context deployment of large language models.
📝 Abstract
The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation