Provably Shorter Scratchpads in Hybrid DeltaNet-Attention Decoders

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work investigates the representational power and efficiency advantages of hybrid recurrent-attention decoders in tasks requiring intermediate memory (scratchpad). Focusing on a constructed parity-conditioned retrieval task, the authors formally demonstrate—under a constant-precision assumption—that a pure Gated DeltaNet cannot solve the task, while a pure attention mechanism necessitates a scratchpad of polynomial length. In contrast, a hybrid architecture combining Gated DeltaNet with Gated Attention achieves a constant-length chain-of-thought, i.e., O(1) scratchpad complexity. This result provides the first theoretical evidence that Qwen-style hybrid decoders can substantially reduce intermediate memory requirements while simultaneously enhancing both reasoning efficiency and expressive capacity.

📝 Abstract

We investigate the expressive power of hybrid recurrent-attention decoders, a class of architectures used in recent open-source language models such as Qwen3-Next and its successors. These models combine Gated Attention heads with recurrent Gated DeltaNet heads. Is there a formal advantage, in terms of model expressivity or efficiency, to such a hybrid architecture? We show that there is. We define parity-conditioned retrieval task and show that under constant-precision assumption, a Qwen-style hybrid of Gated DeltaNet and Gated Attention solves this task with a constant scratchpad, or equivalently $O(1)$ chain-of-thought steps. In contrast, no similar solution exists for pure Gated DeltaNet models, while pure Gated Attention requires at least a polynomial scratchpad.

Problem

Research questions and friction points this paper is trying to address.

hybrid decoder

expressive power

scratchpad length

Gated DeltaNet

Gated Attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid DeltaNet-Attention

constant scratchpad

parity-conditioned retrieval