DUET: Disaggregated Hybrid Mamba-Transformer LLMs with Prefill and Decode-Specific Packages

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the asymmetric computational and memory demands between prefill and decode phases in hybrid Mamba-Transformer models, as well as the inefficiency of mapping state space models (SSMs) onto general-purpose accelerators. To this end, the authors propose DUET, a decoupled accelerator architecture that employs a systolic array tile for large-scale matrix operations and long-sequence SSMs during prefill, and a vector-unit array with high-bandwidth on-chip memory to accelerate per-token SSM and vector-matrix computations during decode. DUET supports runtime reconfiguration to accommodate hybrid model structures and is the first architecture to achieve hardware-level decoupling of prefill and decode phases. Evaluated on Nemotron-H-56B, Zamba2-7B, and Llama3-8 B, DUET reduces time-to-first-token by 4×, improves throughput by 1.4×, and decreases inter-token latency by 1.5× compared to the B200 GPU.

Technology Category

Application Category

📝 Abstract

Large language models operate in distinct compute-bound prefill followed by memory bandwidth-bound decode phases. Hybrid Mamba-Transformer models inherit this asymmetry while adding state space model (SSM) recurrences and element-wise operations that map poorly to matmul-centric accelerators. This mismatch causes performance bottlenecks, showing that a homogeneous architecture cannot satisfy all requirements. We introduce DUET, a disaggregated accelerator that assigns prefill and decode phases to specialized packages. The Prefill package utilizes systolic array chiplets with off-package memory for efficient large matrix multiplications and long-sequence SSMs. The Decode package utilizes vector-unit arrays with high-bandwidth in-package memory to accelerate token-by-token SSM and vector-matrix multiplications. Both architectures are runtime-configurable to support hybrid models with mixed Mamba and attention layers. Evaluations on Nemotron-H-56B, Zamba2-7B, and Llama3-8B across four workloads show that DUET achieves 4x faster time to first token, 1.4x higher throughput, and 1.5x lower time between tokens over the B200 GPU.

Problem

Research questions and friction points this paper is trying to address.

Hybrid Mamba-Transformer

prefill-decode asymmetry

state space models

accelerator mismatch

LLM inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disaggregated Accelerator

Hybrid Mamba-Transformer

Prefill/Decode Specialization