Multi-Token Residual Prediction

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

Diffusion language models suffer from degraded generation quality and computational redundancy in parallel multi-token generation. This work proposes MRP, a lightweight module that leverages the high similarity of logit distributions between adjacent denoising steps—a property previously unexploited—to enable dependency-aware, efficient multi-token denoising. By performing a single forward pass through the backbone network to predict logit residuals, MRP significantly reduces redundant computation. Integrated into the SDAR framework, the method supports both direct and speculative decoding modes. Experiments across models ranging from 1.7B to 8B parameters demonstrate up to a 1.42× lossless speedup, substantially accelerating inference and code generation without compromising output quality.

📝 Abstract

Diffusion Language Models (DLMs) generate text by iteratively denoising masked token sequences, offering a tradeoff between parallelism and quality compared to autoregressive models. In current practice, the number of tokens decoded per step is controlled by a confidence threshold, and quality degrades monotonically as more tokens are denoised per step. We introduce Multi-token Residual Prediction (MRP), a lightweight module that enables dependency-aware multi-token denoising within a single backbone forward pass. MRP exploits a key property of the denoising process: the logit distributions at adjacent denoising steps are remarkably similar. Rather than running the backbone a second time to obtain the next-step logits, MRP predicts the residual between steps from the backbone's hidden states, effectively denoising more tokens per backbone forward at a fraction of the cost. We deploy MRP in two inference modes: direct decoding, which uses the corrected logits without verification for a tunable quality--speed tradeoff; and speculative decoding, which verifies MRP's proposals against the backbone for lossless acceleration. Experiments on SDAR models at the 1.7B, 4B, and 8B scales across reasoning and code generation benchmarks demonstrate up to $1.42\times$ lossless speedup in SGLang.

Problem

Research questions and friction points this paper is trying to address.

Diffusion Language Models

multi-token denoising

quality degradation

parallel text generation

token decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-token Residual Prediction

Diffusion Language Models

Speculative Decoding