SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling

📅 2025-04-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Language models exhibit poor extrapolation capability on sequences significantly longer than their training context and typically require costly long-context fine-tuning. Method: This paper proposes SWAN-GPT, an efficient decoder architecture enabling robust length extrapolation without long-sequence fine-tuning. Its core innovations include the first alternating stack of NoPE (No Positional Encoding) and SWA-RoPE (Sliding Window Adaptive RoPE), coupled with inference-time dynamic attention score scaling. This design integrates sliding window attention with lightweight architectural transfer, substantially reducing training overhead. Contribution/Results: SWAN-GPT maintains strong performance on sequences 4–8× longer than its training length, achieves significant inference throughput gains, and can be efficiently instantiated via minimal continued pretraining of standard GPT checkpoints. It demonstrates superior generalization, computational efficiency, and scalability across diverse sequence lengths.

Technology Category

Application Category

📝 Abstract
We present a decoder-only Transformer architecture that robustly generalizes to sequence lengths substantially longer than those seen during training. Our model, SWAN-GPT, interleaves layers without positional encodings (NoPE) and sliding-window attention layers equipped with rotary positional encodings (SWA-RoPE). Experiments demonstrate strong performance on sequence lengths significantly longer than the training length without the need for additional long-context training. This robust length extrapolation is achieved through our novel architecture, enhanced by a straightforward dynamic scaling of attention scores during inference. In addition, SWAN-GPT is more computationally efficient than standard GPT architectures, resulting in cheaper training and higher throughput. Further, we demonstrate that existing pre-trained decoder-only models can be efficiently converted to the SWAN architecture with minimal continued training, enabling longer contexts. Overall, our work presents an effective approach for scaling language models to longer contexts in a robust and efficient manner.
Problem

Research questions and friction points this paper is trying to address.

Enhancing long-context generalization in Transformer models
Improving computational efficiency for long sequences
Enabling conversion of pre-trained models to longer contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoder-only Transformer with NoPE and SWA-RoPE layers
Dynamic scaling of attention scores during inference
Efficient conversion of pre-trained models to SWAN architecture
🔎 Similar Papers
No similar papers found.