Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high inference latency of large-scale diffusion language models—hindering real-time code generation—this paper proposes a non-autoregressive, parallel generation framework based on discrete-state diffusion. Departing from conventional token-by-token autoregressive decoding, our approach models latent states discretely and enables full-sequence parallel sampling, substantially improving inference throughput. On an H20 GPU, it achieves 2,146 tokens/sec—the fastest reported inference speed for diffusion-based code generation models to date—and is the first to exceed 2,000 tokens/sec. It maintains state-of-the-art functional correctness on major code-generation benchmarks (HumanEval, MBPP), establishing a new speed–quality Pareto frontier. Compared to Mercury and Gemini Diffusion under identical hardware conditions, our method delivers significantly higher throughput while preserving or improving generation quality.

Technology Category

Application Category

📝 Abstract
We present Seed Diffusion Preview, a large-scale language model based on discrete-state diffusion, offering remarkably fast inference speed. Thanks to non-sequential, parallel generation, discrete diffusion models provide a notable speedup to mitigate the inherent latency of token-by-token decoding, as demonstrated recently (e.g., Mercury Coder, Gemini Diffusion). Seed Diffusion Preview achieves an inference speed of 2,146 token/s over H20 GPUs while maintaining competitive performance across a sweep of standard code evaluation benchmarks, significantly faster than contemporary Mercury and Gemini Diffusion, establishing new state of the art on the speed-quality Pareto frontier for code models.
Problem

Research questions and friction points this paper is trying to address.

Improving inference speed in large-scale diffusion language models
Enhancing parallel generation for faster token decoding
Optimizing speed-quality trade-off in code generation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete-state diffusion for fast inference
Parallel generation reduces decoding latency
Achieves 2146 tokens/s on H20 GPUs
Yuxuan Song
Yuxuan Song
Tsinghua University
Deep Generative ModelsLLM4Science,
Z
Zheng Zhang
Institute for AI Industry Research (AIR), Tsinghua University
C
Cheng Luo
ByteDance Seed
P
Pengyang Gao
ByteDance Seed
F
Fan Xia
ByteDance Seed
H
Hao Luo
ByteDance Seed
Z
Zheng Li
ByteDance Seed
Y
Yuehang Yang
ByteDance Seed
H
Hongli Yu
ByteDance Seed
X
Xingwei Qu
ByteDance Seed
Yuwei Fu
Yuwei Fu
McGill Univeristy
reinforcement learning
J
Jing Su
ByteDance Seed
G
Ge Zhang
ByteDance Seed
W
Wenhao Huang
ByteDance Seed
M
Mingxuan Wang
ByteDance Seed
L
Lin Yan
ByteDance Seed
X
Xiaoying Jia
ByteDance Seed
J
Jingjing Liu
ByteDance Seed
Wei-Ying Ma
Wei-Ying Ma
Tsinghua University
Generative AI and Large Language Models (LLMs) for Science
Y
Ya-Qin Zhang
Institute for AI Industry Research (AIR), Tsinghua University
Y
Yonghui Wu
ByteDance Seed
H
Hao Zhou
Institute for AI Industry Research (AIR), Tsinghua University