🤖 AI Summary
This work addresses the performance limitations of arbitrary-order autoregressive models, which arise from competition between structural and semantic information for attention resources during generation. The study reveals that the core benefit of two-stream attention lies in decoupling structural induction from semantic prediction, rather than merely separating positional and content information. To this end, the authors propose Decoupled RoPE, an enhanced rotary positional encoding that explicitly isolates positional and content representations, and conduct systematic ablation studies within an arbitrary-order autoregressive framework. Results show that while Decoupled RoPE improves short-sequence generation, its effectiveness diminishes with longer sequences. In contrast, two-stream attention significantly enhances long-sequence generation quality by alleviating structural–semantic interference, thereby confirming the existence and critical impact of this trade-off on model performance.
📝 Abstract
Any-order autoregressive models (AO-ARMs) offer a promising path toward efficient masked diffusion by enabling native key-value caching, but competitive performance has so far required two-stream attention, typically motivated as a means of decoupling token content from position. In this work, we argue that two-stream attention may be serving a more subtle role. We identify a structural-semantic tradeoff in any-order generation: the hidden representation at each step must simultaneously attend to semantically informative tokens for prediction and structurally recent tokens for summarization, objectives that compete for attention capacity in a single stream but can specialize across two streams. To isolate this tradeoff from position-content separation, we propose Decoupled RoPE, a modification to rotary position embeddings that provides target position information without revealing target content. Decoupled RoPE performs competitively at short sequence lengths--where semantic and structural proximity coincide--but degrades as sequence length increases and the two orderings diverge. These results suggest that the success of two-stream attention stems not merely from separating position from content, but from circumventing the deeper structural-semantic tradeoff inherent to any-order generation.