Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

To address low inference efficiency, weak long-context modeling capability, and reliance on explicit positional encoding in long-sequence generation, this paper proposes SambaY—a decoder-hybrid-decoder architecture. Its core innovation is the gated memory unit (GMU), which enables cross-layer memory state sharing to eliminate positional encoding and ensure linear prefill complexity. Integrated with the Samba self-decoder, YOCO cross-decoder structure, and Differential Attention, SambaY significantly enhances long-range dependency modeling while reducing irreducible loss. Experiments demonstrate that SambaY outperforms Phi4-mini-Reasoning on reasoning-intensive benchmarks including Math500, AIME24/25, and GPQA Diamond. Moreover, when generating 32K tokens from a 2K-token input, it achieves a 10× improvement in decoding throughput—validating its joint optimization of high throughput and strong generalization.

Technology Category

Application Category

📝 Abstract

Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https://github.com/microsoft/ArchScale.

Problem

Research questions and friction points this paper is trying to address.

Efficient memory sharing across SSM layers

Enhancing decoding efficiency in long-context tasks

Improving performance scalability without positional encoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gated Memory Unit enables efficient memory sharing

Decoder-Hybrid-Decoder architecture enhances decoding efficiency

Differential Attention boosts reasoning task performance

🔎 Similar Papers

FiDeLiS: Faithful Reasoning in Large Language Model for Knowledge Graph Question Answering