🤖 AI Summary
This work investigates the representational power and efficiency advantages of hybrid recurrent-attention decoders in tasks requiring intermediate memory (scratchpad). Focusing on a constructed parity-conditioned retrieval task, the authors formally demonstrate—under a constant-precision assumption—that a pure Gated DeltaNet cannot solve the task, while a pure attention mechanism necessitates a scratchpad of polynomial length. In contrast, a hybrid architecture combining Gated DeltaNet with Gated Attention achieves a constant-length chain-of-thought, i.e., O(1) scratchpad complexity. This result provides the first theoretical evidence that Qwen-style hybrid decoders can substantially reduce intermediate memory requirements while simultaneously enhancing both reasoning efficiency and expressive capacity.
📝 Abstract
We investigate the expressive power of hybrid recurrent-attention decoders, a class of architectures used in recent open-source language models such as Qwen3-Next and its successors. These models combine Gated Attention heads with recurrent Gated DeltaNet heads. Is there a formal advantage, in terms of model expressivity or efficiency, to such a hybrid architecture? We show that there is. We define parity-conditioned retrieval task and show that under constant-precision assumption, a Qwen-style hybrid of Gated DeltaNet and Gated Attention solves this task with a constant scratchpad, or equivalently $O(1)$ chain-of-thought steps. In contrast, no similar solution exists for pure Gated DeltaNet models, while pure Gated Attention requires at least a polynomial scratchpad.