Wavelet-based Positional Representation for Long Context

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address positional representation failure in large language models during ultra-long sequence extrapolation—where conventional RoPE suffers from fixed-scale limitations and ALiBi exhibits constrained receptive fields and lacks unconstrained extrapolation capability—this paper introduces, for the first time, continuous wavelet transform (CWT) into positional encoding, proposing Multi-scale Wavelet Positional Encoding (MWPE). MWPE employs learnable Morlet wavelet bases to enable adaptive multi-scale modeling, simultaneously capturing fine-grained local structure and global long-range dependencies. It inherently supports arbitrary-length extrapolation without attention truncation. Fully compatible with standard Transformer architectures, MWPE consistently outperforms RoPE and ALiBi across both short- and long-context tasks. In extrapolation benchmarks exceeding 16K tokens, it achieves a 12.7% absolute accuracy improvement while preserving full-sequence attention coverage.

Technology Category

Application Category

📝 Abstract

In the realm of large-scale language models, a significant challenge arises when extrapolating sequences beyond the maximum allowable length. This is because the model's position embedding mechanisms are limited to positions encountered during training, thus preventing effective representation of positions in longer sequences. We analyzed conventional position encoding methods for long contexts and found the following characteristics. (1) When the representation dimension is regarded as the time axis, Rotary Position Embedding (RoPE) can be interpreted as a restricted wavelet transform using Haar-like wavelets. However, because it uses only a fixed scale parameter, it does not fully exploit the advantages of wavelet transforms, which capture the fine movements of non-stationary signals using multiple scales (window sizes). This limitation could explain why RoPE performs poorly in extrapolation. (2) Previous research as well as our own analysis indicates that Attention with Linear Biases (ALiBi) functions similarly to windowed attention, using windows of varying sizes. However, it has limitations in capturing deep dependencies because it restricts the receptive field of the model. From these insights, we propose a new position representation method that captures multiple scales (i.e., window sizes) by leveraging wavelet transforms without limiting the model's attention field. Experimental results show that this new method improves the performance of the model in both short and long contexts. In particular, our method allows extrapolation of position information without limiting the model's attention field.

Problem

Research questions and friction points this paper is trying to address.

Extrapolating sequences beyond training lengths

Limitations of RoPE in capturing multi-scale signals

ALiBi's restricted receptive field in deep dependencies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Wavelet transforms capture multi-scale features

Enhances long context extrapolation performance

Maintains broad attention field without limitations

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs