Provably Shorter Scratchpads in Hybrid DeltaNet-Attention Decoders

📅 2026-05-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

230K/year
🤖 AI Summary
This work investigates the representational power and efficiency advantages of hybrid recurrent-attention decoders in tasks requiring intermediate memory (scratchpad). Focusing on a constructed parity-conditioned retrieval task, the authors formally demonstrate—under a constant-precision assumption—that a pure Gated DeltaNet cannot solve the task, while a pure attention mechanism necessitates a scratchpad of polynomial length. In contrast, a hybrid architecture combining Gated DeltaNet with Gated Attention achieves a constant-length chain-of-thought, i.e., O(1) scratchpad complexity. This result provides the first theoretical evidence that Qwen-style hybrid decoders can substantially reduce intermediate memory requirements while simultaneously enhancing both reasoning efficiency and expressive capacity.
📝 Abstract
We investigate the expressive power of hybrid recurrent-attention decoders, a class of architectures used in recent open-source language models such as Qwen3-Next and its successors. These models combine Gated Attention heads with recurrent Gated DeltaNet heads. Is there a formal advantage, in terms of model expressivity or efficiency, to such a hybrid architecture? We show that there is. We define parity-conditioned retrieval task and show that under constant-precision assumption, a Qwen-style hybrid of Gated DeltaNet and Gated Attention solves this task with a constant scratchpad, or equivalently $O(1)$ chain-of-thought steps. In contrast, no similar solution exists for pure Gated DeltaNet models, while pure Gated Attention requires at least a polynomial scratchpad.
Problem

Research questions and friction points this paper is trying to address.

hybrid decoder
expressive power
scratchpad length
Gated DeltaNet
Gated Attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid DeltaNet-Attention
constant scratchpad
parity-conditioned retrieval
chain-of-thought complexity
recurrent-attention decoders
🔎 Similar Papers