Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the inefficiency of autoregressive drafters in speculative decoding, which incur high sequential overhead due to explicit causal modeling, and the limited intra-chunk dependency capture of parallel drafters despite their efficiency. To reconcile these issues, the authors propose Domino, a framework that decouples causal modeling from drafting execution: a parallel backbone network first generates coarse-grained token distributions for entire chunks, followed by a lightweight Domino head that refines these predictions with prefix-aware causal information. The method introduces a novel backbone-anchored teacher-forcing curriculum that first strengthens the parallel backbone and then progressively optimizes the corrected distributions, ensuring both training stability and inference efficiency. Evaluated on Qwen3, Domino achieves up to 5.49× end-to-end speedup with Transformers backend and 5.8× higher throughput under SGLang serving.

📝 Abstract

Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost: autoregressive drafters model causal dependencies among draft tokens but incur sequential overhead, while parallel drafters reduce drafting cost but weaken intra-block dependency modeling. In this paper, we propose Domino, a speculative decoding framework that decouples causal dependency modeling from expensive autoregressive draft execution. Domino first uses a parallel draft backbone to produce preliminary draft distributions for the entire block, and then applies a lightweight Domino head to refine them with prefix-dependent causal information. To stabilize teacher-forced causal encoding, we further introduce a base-anchored training curriculum that first strengthens the parallel backbone and then gradually shifts optimization toward the causally corrected final distribution. Experiments on Qwen3 models show that Domino achieves up to \(5.49\times\) end-to-end speedup under the Transformers backend and up to \(5.8\times\) throughput speedup under SGLang serving.

Problem

Research questions and friction points this paper is trying to address.

speculative decoding

autoregressive drafting

causal dependency

draft quality

inference acceleration

Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding

causal modeling

parallel drafting