🤖 AI Summary
This work addresses a key limitation in existing diffusion language models (dLLMs), which rely on high-confidence thresholds to assess conditional independence, thereby constraining the scalability of parallel decoding and preventing timely utilization of correctly converged tokens in early denoising steps. To overcome this, the authors propose LEAP, a plug-and-play method that requires no additional training. LEAP leverages lookahead context filtering and a multi-sequence aggregation mechanism to identify converged tokens early in the denoising process and establishes a consistency verification between early convergence and token correctness. Empirically, LEAP reduces the average number of denoising steps by approximately 30% across multiple benchmarks. When combined with dParallel on GSM8K, it achieves a decoding throughput of 7.2 tokens per step while preserving model accuracy.
📝 Abstract
Diffusion Language Models (dLLMs) have garnered significant attention for their potential in highly parallel processing. The parallel capabilities of existing dLLMs stem from the assumption of conditional independence at high confidence levels, which ensures negligible discrepancy between the marginal and joint distributions. However, the stringent confidence thresholds required to preserve accuracy severely constrain the scalability of parallelism. Through systematic token-level statistical analysis, we reveal that a substantial proportion of tokens converge to their correct predictions early in the denoising process yet fail to reach standard confidence thresholds, confirming that current confidence-based criteria are overly conservative. In response, we introduce LEAP (Lookahead Early-Convergence Token Detection for Accelerated Parallel Decoding). LEAP is a training-free, plug-and-play method that leverages future context filtering and multi-sequence superposition to detect early-converging tokens. By validating the alignment between early convergence and correctness, we enable reliable early decoding of these tokens. Benchmarking across diverse domains demonstrates that LEAP significantly lowers inference latency and decoding steps. Compared to confidence-based decoding, the average number of denoising steps is reduced by about 30%. On the GSM8K dataset, combining LEAP with dParallel accelerates decoding to 7.2 tokens per step while preserving model precision. LEAP effectively breaks the reliance on high-confidence priors, offering a novel paradigm for parallel decoding.