🤖 AI Summary
This work addresses the computational redundancy and suboptimal hardware utilization of existing rotary position encoding (RoPE) methods, which rely on vector-level splitting and merging operations that become inefficient in multidimensional settings. The paper proposes RoME, the first approach to unify and reformulate RoPE as a matrix transformation, eliminating dimension-dependent operations while preserving mathematical equivalence. This reformulation simplifies implementation and enables fused parallel execution across both Cube and Vector units on modern neural processing units (NPUs). Experimental results demonstrate that RoME achieves significant acceleration at both operator and full-model levels, substantially improving inference efficiency and hardware compatibility of Transformers across language, vision, and 3D tasks.
📝 Abstract
Rotary Position Embedding (RoPE) has become a core component of modern Transformer architectures across language, vision, and 3D domains. However, existing implementations rely on vector-level split and merge operations that introduce non-negligible computational overhead, often overlooked in attention optimization. The problem is further amplified in multi-dimensional settings (e.g., 2D and 3D RoPE), where additional vector operations and uneven feature partitions degrade hardware utilization. To overcome these limitations, we propose RoME (Rotary Matrix position Embedding), a mathematically equivalent yet computationally efficient reformulation of RoPE that replaces vector operations with unified matrix transformations. RoME eliminates dimension-specific operations, simplifies implementation, and enables fused parallel execution across Cube and Vector units on modern NPUs. Experiments show that RoME delivers substantial acceleration at both the operator and full-model levels. The implementation is available at https://gitcode.com/cann/ops-transformer/blob/master/experimental/posembedding/rope_matrix/README.md.