Efficient Matrix Implementation for Rotary Position Embedding

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the computational redundancy and suboptimal hardware utilization of existing rotary position encoding (RoPE) methods, which rely on vector-level splitting and merging operations that become inefficient in multidimensional settings. The paper proposes RoME, the first approach to unify and reformulate RoPE as a matrix transformation, eliminating dimension-dependent operations while preserving mathematical equivalence. This reformulation simplifies implementation and enables fused parallel execution across both Cube and Vector units on modern neural processing units (NPUs). Experimental results demonstrate that RoME achieves significant acceleration at both operator and full-model levels, substantially improving inference efficiency and hardware compatibility of Transformers across language, vision, and 3D tasks.

Technology Category

Application Category

📝 Abstract
Rotary Position Embedding (RoPE) has become a core component of modern Transformer architectures across language, vision, and 3D domains. However, existing implementations rely on vector-level split and merge operations that introduce non-negligible computational overhead, often overlooked in attention optimization. The problem is further amplified in multi-dimensional settings (e.g., 2D and 3D RoPE), where additional vector operations and uneven feature partitions degrade hardware utilization. To overcome these limitations, we propose RoME (Rotary Matrix position Embedding), a mathematically equivalent yet computationally efficient reformulation of RoPE that replaces vector operations with unified matrix transformations. RoME eliminates dimension-specific operations, simplifies implementation, and enables fused parallel execution across Cube and Vector units on modern NPUs. Experiments show that RoME delivers substantial acceleration at both the operator and full-model levels. The implementation is available at https://gitcode.com/cann/ops-transformer/blob/master/experimental/posembedding/rope_matrix/README.md.
Problem

Research questions and friction points this paper is trying to address.

Rotary Position Embedding
computational overhead
multi-dimensional RoPE
hardware utilization
vector operations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rotary Position Embedding
Matrix Transformation
Efficient Implementation
Multi-dimensional RoPE
NPU Acceleration
🔎 Similar Papers
No similar papers found.
C
Chen Minqi
Huawei Technologies
Z
Zhongqi Yue
Nanyang Technological University
Shihao Zhang
Shihao Zhang
University of California, San Diego
Applied Mathematics
Yun Xu
Yun Xu
School of Computer Science, University of Science and Technology of China
Parallel ComputingBioinformatic Algorithms
P
Peng Wu
Huawei Technologies
K
Kaixiang Xu
Huawei Technologies
Z
Zeyi Huang
Huawei Technologies
H
Hanwang Zhang
Nanyang Technological University