Positional Encoding via Token-Aware Phase Attention

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
RoPE introduces distance-dependent intrinsic bias in attention scores, degrading long-context modeling capability; existing extension methods typically rely on post-pretraining rescaling or hyperparameter tuning. This paper theoretically characterizes this bias mechanism for the first time and proposes Trainable Phase Attention (TAPA): a lightweight, learnable phase function embedded directly into attention computation to enable dynamic position awareness. TAPA requires neither full retraining nor manual hyperparameter adjustment, naturally extrapolating to unseen sequence lengths—enabling zero-shot context extension and efficient fine-tuning. Empirical evaluation across multiple long-context benchmarks demonstrates that TAPA significantly reduces perplexity and consistently outperforms RoPE and its major variants. Results validate TAPA’s effectiveness and generalizability in both context-length extension and length extrapolation scenarios.

Technology Category

Application Category

📝 Abstract
We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE's ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light fine-tuning, extrapolates to unseen lengths, and attains significantly lower perplexity on long-context than RoPE families.
Problem

Research questions and friction points this paper is trying to address.

Addresses RoPE's distance-dependent bias in attention
Introduces TAPA for better long-context modeling
Enables extrapolation to unseen sequence lengths
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable phase function in attention mechanism
Direct light fine-tuning for longer contexts
Extrapolates to unseen lengths with lower perplexity
🔎 Similar Papers
No similar papers found.