Positional Encoding via Token-Aware Phase Attention

📅 2025-09-15

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

RoPE introduces distance-dependent intrinsic bias in attention scores, degrading long-context modeling capability; existing extension methods typically rely on post-pretraining rescaling or hyperparameter tuning. This paper theoretically characterizes this bias mechanism for the first time and proposes Trainable Phase Attention (TAPA): a lightweight, learnable phase function embedded directly into attention computation to enable dynamic position awareness. TAPA requires neither full retraining nor manual hyperparameter adjustment, naturally extrapolating to unseen sequence lengths—enabling zero-shot context extension and efficient fine-tuning. Empirical evaluation across multiple long-context benchmarks demonstrates that TAPA significantly reduces perplexity and consistently outperforms RoPE and its major variants. Results validate TAPA’s effectiveness and generalizability in both context-length extension and length extrapolation scenarios.

Technology Category

Application Category

📝 Abstract

We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE's ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light fine-tuning, extrapolates to unseen lengths, and attains significantly lower perplexity on long-context than RoPE families.

Problem

Research questions and friction points this paper is trying to address.

Addresses RoPE's distance-dependent bias in attention

Introduces TAPA for better long-context modeling

Enables extrapolation to unseen sequence lengths

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable phase function in attention mechanism

Direct light fine-tuning for longer contexts

Extrapolates to unseen lengths with lower perplexity

🔎 Similar Papers

No similar papers found.