DUET: Disaggregated Hybrid Mamba-Transformer LLMs with Prefill and Decode-Specific Packages

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the asymmetric computational and memory demands between prefill and decode phases in hybrid Mamba-Transformer models, as well as the inefficiency of mapping state space models (SSMs) onto general-purpose accelerators. To this end, the authors propose DUET, a decoupled accelerator architecture that employs a systolic array tile for large-scale matrix operations and long-sequence SSMs during prefill, and a vector-unit array with high-bandwidth on-chip memory to accelerate per-token SSM and vector-matrix computations during decode. DUET supports runtime reconfiguration to accommodate hybrid model structures and is the first architecture to achieve hardware-level decoupling of prefill and decode phases. Evaluated on Nemotron-H-56B, Zamba2-7B, and Llama3-8 B, DUET reduces time-to-first-token by 4×, improves throughput by 1.4×, and decreases inter-token latency by 1.5× compared to the B200 GPU.

Technology Category

Application Category

📝 Abstract
Large language models operate in distinct compute-bound prefill followed by memory bandwidth-bound decode phases. Hybrid Mamba-Transformer models inherit this asymmetry while adding state space model (SSM) recurrences and element-wise operations that map poorly to matmul-centric accelerators. This mismatch causes performance bottlenecks, showing that a homogeneous architecture cannot satisfy all requirements. We introduce DUET, a disaggregated accelerator that assigns prefill and decode phases to specialized packages. The Prefill package utilizes systolic array chiplets with off-package memory for efficient large matrix multiplications and long-sequence SSMs. The Decode package utilizes vector-unit arrays with high-bandwidth in-package memory to accelerate token-by-token SSM and vector-matrix multiplications. Both architectures are runtime-configurable to support hybrid models with mixed Mamba and attention layers. Evaluations on Nemotron-H-56B, Zamba2-7B, and Llama3-8B across four workloads show that DUET achieves 4x faster time to first token, 1.4x higher throughput, and 1.5x lower time between tokens over the B200 GPU.
Problem

Research questions and friction points this paper is trying to address.

Hybrid Mamba-Transformer
prefill-decode asymmetry
state space models
accelerator mismatch
LLM inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disaggregated Accelerator
Hybrid Mamba-Transformer
Prefill/Decode Specialization
State Space Model (SSM)
Runtime-Configurable Architecture
🔎 Similar Papers
No similar papers found.
Alish Kanani
Alish Kanani
University of Wisconsin-Madison
ChipletsThermal managementPerformance ModelingTask SchedulingApproximate Circuits
S
Sangwan Lee
University of Ulsan
H
Han Lyu
University of Wisconsin–Madison
J
Jiahao Lin
University of Wisconsin–Madison
Jaehyun Park
Jaehyun Park
Assistant Professor, School of Electrical Engineering, University of Ulsan
Low-power designIoT system design
U
Umit Y. Ogras
University of Wisconsin–Madison