TiDAR: Think in Diffusion, Talk in Autoregression

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of simultaneously achieving high generation quality and inference efficiency in diffusion language models (DLMs) versus autoregressive models (ARMs), this paper proposes a sequence-level hybrid architecture: parallel multi-token drafting followed by autoregressive refinement (speaking) within a single forward pass. The core innovation is a structured attention masking mechanism that enables collaborative optimization between drafting and verification while preserving full parallelism. Additionally, the method integrates KV cache reuse and parallel sequence generation. Experiments on 1.5B and 8B models demonstrate throughput improvements of 4.71×–5.91× over state-of-the-art speculative decoding and existing DLM approaches. Crucially, generation quality matches that of ARMs—marking the first DLM framework to bridge the quality gap with ARMs without compromising parallelism.

Technology Category

Application Category

📝 Abstract
Diffusion language models hold the promise of fast parallel generation, while autoregressive (AR) models typically excel in quality due to their causal structure aligning naturally with language modeling. This raises a fundamental question: can we achieve a synergy with high throughput, higher GPU utilization, and AR level quality? Existing methods fail to effectively balance these two aspects, either prioritizing AR using a weaker model for sequential drafting (speculative decoding), leading to lower drafting efficiency, or using some form of left-to-right (AR-like) decoding logic for diffusion, which still suffers from quality degradation and forfeits its potential parallelizability. We introduce TiDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks. This design exploits the free GPU compute density, achieving a strong balance between drafting and verification capacity. Moreover, TiDAR is designed to be serving-friendly (low overhead) as a standalone model. We extensively evaluate TiDAR against AR models, speculative decoding, and diffusion variants across generative and likelihood tasks at 1.5B and 8B scales. Thanks to the parallel drafting and sampling as well as exact KV cache support, TiDAR outperforms speculative decoding in measured throughput and surpasses diffusion models like Dream and Llada in both efficiency and quality. Most notably, TiDAR is the first architecture to close the quality gap with AR models while delivering 4.71x to 5.91x more tokens per second.
Problem

Research questions and friction points this paper is trying to address.

Balancing parallel generation speed with autoregressive model quality
Overcoming limitations of speculative decoding and diffusion variants
Achieving high throughput without sacrificing language modeling performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid architecture combining diffusion drafting with autoregressive sampling
Structured attention masks enable parallel drafting in single forward pass
Achieves AR-level quality with 4.7-5.9x throughput improvement
🔎 Similar Papers
No similar papers found.
J
Jingyu Liu
NVIDIA; affiliated with University of Chicago. Work done during Jingyu Liu’s internship at NVIDIA.
X
Xin Dong
NVIDIA
Z
Zhifan Ye
NVIDIA; affiliated with Georgia Institute of Technology. Work done during Zhifan Ye’s internship at NVIDIA.
R
Rishabh Mehta
NVIDIA
Yonggan Fu
Yonggan Fu
NVIDIA Research
Efficient AIEfficient Language ModelsModel Compression
V
Vartika Singh
NVIDIA
Jan Kautz
Jan Kautz
Vice President of Research, NVIDIA Research
Computer VisionMachine LearningVisual Computing
C
Ce Zhang
NVIDIA; affiliated with University of Chicago. Work done during Ce Zhang’s internship at NVIDIA.
Pavlo Molchanov
Pavlo Molchanov
NVIDIA Research
AIMachine LearningEfficient Deep LearningSemi-supervised learningnetwork inversion