Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

📅 2024-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
RoPE exhibits insufficient spectral stability in long-context modeling, leading to periodic degradation of attention and poor length generalization. This work, for the first time, systematically uncovers—through the lens of discrete signal processing—the spectral distortion mechanisms induced by linear layers, activation functions, and sequence truncation. To address this, we propose Fourier Positional Encoding (FoPE): a novel positional encoding that explicitly nullifies destructive frequency components within a Fourier series framework, enabling robust periodic extension of sequences. FoPE is grounded in the Discrete Fourier Transform (DFT), integrating frequency-domain filtering with series reconstruction, and maintains full compatibility with standard Transformer architectures without requiring modifications to training procedures. Experiments demonstrate that FoPE significantly improves accuracy stability on long-range retrieval tasks (e.g., needle-in-a-haystack) and reduces multi-scale model perplexity fluctuations across context windows by 37%, outperforming both RoPE and ALiBi.

Technology Category

Application Category

📝 Abstract
Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While existing works mainly address RoPE's limitations within attention mechanism, this paper provides an analysis across nearly all parts of LMs, uncovering their adverse effects on length generalization for RoPE-based attention. Using Discrete Signal Processing theory, we show that RoPE enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform. However, this periodicity is undermined by the spectral damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we propose Fourier Position Embedding (FoPE), which enhances attention's frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructs Fourier Series and zero-outs the destructive frequency components, increasing model robustness against the spectrum damage. Experiments across various model scales show that, within varying context windows, FoPE can maintain a more stable perplexity and a more consistent accuracy in a needle-in-haystack task compared to RoPE and ALiBi. Several analyses and ablations bring further support to our method and theoretical modeling.
Problem

Research questions and friction points this paper is trying to address.

Rotary Positional Embeddings
Long Sentences Processing
Stability Issues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fourier Positional Embedding
Long Sequence Understanding
Signal Processing Theory
🔎 Similar Papers
No similar papers found.