🤖 AI Summary
Standard multidimensional RoPE underperforms in fine-grained image generation—struggling to model spatial relationships, chromatic cues, and object counting—due to its fixed frequency allocation, axis-wise independence, and uniform head-wise processing. To address this, we propose Head-Adaptive RoPE, a lightweight, SVD-parameterized learnable linear transformation applied before rotation mapping. It dynamically tailors per-head frequency spectra, semantic alignment of rotation planes, and positional receptive fields, while preserving relative position invariance. The method is plug-and-play and fully compatible with mainstream diffusion architectures such as MMDiT and Flux. Experiments demonstrate substantial improvements over strong RoPE baselines and alternative extensions on both ImageNet class-conditional generation and text-to-image synthesis. Head-Adaptive RoPE thus provides an efficient, general-purpose solution for enhancing spatial structural awareness in generative models.
📝 Abstract
Transformers rely on explicit positional encoding to model structure in data. While Rotary Position Embedding (RoPE) excels in 1D domains, its application to image generation reveals significant limitations such as fine-grained spatial relation modeling, color cues, and object counting. This paper identifies key limitations of standard multi-dimensional RoPE-rigid frequency allocation, axis-wise independence, and uniform head treatment-in capturing the complex structural biases required for fine-grained image generation. We propose HARoPE, a head-wise adaptive extension that inserts a learnable linear transformation parameterized via singular value decomposition (SVD) before the rotary mapping. This lightweight modification enables dynamic frequency reallocation, semantic alignment of rotary planes, and head-specific positional receptive fields while rigorously preserving RoPE's relative-position property. Extensive experiments on class-conditional ImageNet and text-to-image generation (Flux and MMDiT) demonstrate that HARoPE consistently improves performance over strong RoPE baselines and other extensions. The method serves as an effective drop-in replacement, offering a principled and adaptable solution for enhancing positional awareness in transformer-based image generative models.