🤖 AI Summary
RoPE introduces distance-dependent intrinsic bias in attention scores, degrading long-context modeling capability; existing extension methods typically rely on post-pretraining rescaling or hyperparameter tuning. This paper theoretically characterizes this bias mechanism for the first time and proposes Trainable Phase Attention (TAPA): a lightweight, learnable phase function embedded directly into attention computation to enable dynamic position awareness. TAPA requires neither full retraining nor manual hyperparameter adjustment, naturally extrapolating to unseen sequence lengths—enabling zero-shot context extension and efficient fine-tuning. Empirical evaluation across multiple long-context benchmarks demonstrates that TAPA significantly reduces perplexity and consistently outperforms RoPE and its major variants. Results validate TAPA’s effectiveness and generalizability in both context-length extension and length extrapolation scenarios.
📝 Abstract
We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE's ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light fine-tuning, extrapolates to unseen lengths, and attains significantly lower perplexity on long-context than RoPE families.