SeqPE: Transformer with Sequential Position Encoding

📅 2025-06-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional positional encodings (PE) in Transformers rely on fixed-size lookup tables, limiting generalization beyond pre-specified sequence lengths; while methods like ALiBi and RoPE improve length extrapolation, their cross-modal adaptation necessitates substantial architectural modifications. This paper proposes Symbolic Sequence Positional Encoding (SSPE): a framework that models position indices as learnable symbolic sequences and employs a lightweight sequence encoder to generate position embeddings end-to-end. To enhance generalization, SSPE introduces two novel losses—position distance contrastive loss and out-of-distribution position distillation loss—enabling zero-modification generalization across multi-dimensional inputs (1D and 2D). Evaluated on language modeling, long-context question answering, and 2D image classification, SSPE consistently outperforms ALiBi, RoPE, and other baselines. In extrapolation settings, it achieves significant improvements in exact match (EM) and accuracy, demonstrating both strong length extrapolation capability and cross-modal scalability.

Technology Category

Application Category

📝 Abstract
Since self-attention layers in Transformers are permutation invariant by design, positional encodings must be explicitly incorporated to enable spatial understanding. However, fixed-size lookup tables used in traditional learnable position embeddings (PEs) limit extrapolation capabilities beyond pre-trained sequence lengths. Expert-designed methods such as ALiBi and RoPE, mitigate this limitation but demand extensive modifications for adapting to new modalities, underscoring fundamental challenges in adaptability and scalability. In this work, we present SeqPE, a unified and fully learnable position encoding framework that represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings in an end-to-end manner. To regularize SeqPE's embedding space, we introduce two complementary objectives: a contrastive objective that aligns embedding distances with a predefined position-distance function, and a knowledge distillation loss that anchors out-of-distribution position embeddings to in-distribution teacher representations, further enhancing extrapolation performance. Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM), and accuracy--particularly under context length extrapolation--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign. We release our code, data, and checkpoints at https://github.com/ghrua/seqpe.
Problem

Research questions and friction points this paper is trying to address.

Enabling spatial understanding in permutation-invariant Transformers
Overcoming fixed-size position embedding extrapolation limits
Adapting position encoding to multi-dimensional inputs seamlessly
Innovation

Methods, ideas, or system contributions that make the work stand out.

SeqPE uses symbolic sequences for position encoding
Lightweight encoder learns embeddings end-to-end
Contrastive and distillation objectives enhance extrapolation
🔎 Similar Papers
No similar papers found.
H
Huyang Li
Nara Institute of Science and Technology (NAIST), Nara, Japan
Y
Yahui Liu
Kuaishou Technology, Beijing, China
H
Hongyu Sun
Nara Institute of Science and Technology (NAIST), Nara, Japan
Deng Cai
Deng Cai
Professor of Computer Science, Zhejiang University
Machine learningComputer visionData miningInformation retrieval
Leyang Cui
Leyang Cui
Tencent AI Lab
Natural Language Processing
Wei Bi
Wei Bi
HKUST
NLGDialog SystemNLPMachine LearningData Mining
P
Peilin Zhao
Tencent, Shenzhen, China
Taro Watanabe
Taro Watanabe
Nara Institute of Science and Technology
Machine TranslationMachine Learning