PaTH Attention: Position Encoding via Accumulating Householder Transformations

πŸ“… 2025-05-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing rotary position embeddings (RoPE) apply content-agnostic positional transformations, limiting model capacity for structured sequence modeling. This paper proposes PaTHβ€”a data-dependent positional encoding method that introduces cumulative, input-driven Householder transformations for position embedding, enabling dynamic adaptation of positional representations to sequence semantics. PaTH employs compact Householder matrices to ensure differentiability and parallelizability, and integrates FlashAttention-style block-wise I/O optimization to balance computational efficiency with representational expressiveness. Experiments demonstrate that PaTH significantly outperforms RoPE and recent state-of-the-art positional encoding methods on synthetic structural modeling tasks and medium-scale language modeling, validating both the effectiveness and generalizability of content-aware positional modeling.

Technology Category

Application Category

πŸ“ Abstract
The attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for modeling structured domains such as language. Rotary position encoding (RoPE) has emerged as the de facto standard approach for position encoding and is part of many modern LLMs. However, in RoPE the key/query transformation between two elements in a sequence is only a function of their relative position and otherwise independent of the actual input. This limits the expressivity of RoPE-based transformers. This paper describes PaTH, a flexible data-dependent position encoding scheme based on accumulated products of Householder(like) transformations, where each transformation is data-dependent, i.e., a function of the input. We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices, and implement a FlashAttention-style blockwise algorithm that minimizes I/O cost. Across both targeted synthetic benchmarks and moderate-scale real-world language modeling experiments, we find that PaTH demonstrates superior performance compared to RoPE and other recent baselines.
Problem

Research questions and friction points this paper is trying to address.

Limitation of RoPE's input-independent relative position encoding
Need for flexible data-dependent position encoding in transformers
Improving expressivity and performance in large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-dependent position encoding via Householder transformations
Efficient parallel training with compact matrix representation
FlashAttention-style blockwise algorithm for minimal I/O cost
πŸ”Ž Similar Papers
No similar papers found.