🤖 AI Summary
To address the significant performance degradation of Rotary Position Embedding (RoPE) and similar positional encodings under ultra-long sequences, this paper systematically analyzes attention distribution differences among RoPE, NoPE, and QK-Norm in long-context settings. We propose, for the first time, a dynamic switching-and-fusion hybrid attention mechanism that adaptively integrates RoPE’s local inductive bias with NoPE’s global modeling capability, while incorporating QK-Norm to stabilize training. Our approach includes RoPE variant tuning, an embedding-free NoPE design, and long-context data augmentation. Empirical evaluation demonstrates substantial improvements over mainstream RoPE-based models on long-text benchmarks—including LongBench and SCROLLS—while maintaining state-of-the-art performance on standard short-context tasks such as MMLU and ARC. This unified framework advances both long- and short-context modeling capabilities within a single architecture.
📝 Abstract
Long-context large language models (LLMs) have achieved remarkable advancements, driven by techniques like Rotary Position Embedding (RoPE) (Su et al., 2023) and its extensions (Chen et al., 2023; Liu et al., 2024c; Peng et al., 2023). By adjusting RoPE parameters and incorporating training data with extended contexts, we can train performant models with considerably longer input sequences. However, existing RoPE-based methods exhibit performance limitations when applied to extended context lengths. This paper presents a comprehensive analysis of various attention mechanisms, including RoPE, No Positional Embedding (NoPE), and Query-Key Normalization (QK-Norm), identifying their strengths and shortcomings in long-context modeling. Our investigation identifies distinctive attention patterns in these methods and highlights their impact on long-context performance, providing valuable insights for architectural design. Building on these findings, we propose a novel architectural based on a hybrid attention mechanism that not only surpasses conventional RoPE-based transformer models in long context tasks but also achieves competitive performance on benchmarks requiring shorter context lengths.