🤖 AI Summary
This work addresses the limitations of existing visual positional encoding methods, which often neglect the geometric and perceptual characteristics inherent to visual modalities, thereby constraining performance in multimodal tasks. To overcome this, the authors propose Parabolic Positional Encoding (PaPE), a novel approach that systematically integrates key geometric priors—including translation invariance, rotation invariance, distance decay, directionality, and context awareness—into its design, yielding a theoretically grounded parabolic formulation. A rotation-invariant variant, PaPE-RI, is also derived. Evaluated across eight diverse datasets spanning images, point clouds, videos, and event streams, PaPE and PaPE-RI achieve state-of-the-art results on seven benchmarks. Notably, in out-of-distribution extrapolation experiments on ImageNet-1K, the method demonstrates an absolute accuracy improvement of up to 10.5%.
📝 Abstract
We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as images, point clouds, videos, or event camera streams-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. We evaluate PaPE on 8 datasets that span 4 modalities. We find that either PaPE or PaPE-RI achieves the top performance on 7 out of 8 datasets. Extrapolation experiments on ImageNet-1K show that PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5% over the next-best position encoding. Code is available at https://github.com/DTU-PAS/parabolic-position-encoding.