🤖 AI Summary
Diffusion language models suffer from semantic drift, repetition, and incoherence in long-text generation due to contextual decay induced by large decoding windows. To address this, we propose a convolutional normalized decoding mechanism that captures long-range dependencies via localized receptive fields—enabling effective context compression without chunking—and a rejection-based rule fine-tuning strategy that imposes explicit semantic consistency constraints during post-training. Together, these components enhance contextual fidelity and fluency in distant-token generation. Experiments demonstrate state-of-the-art performance on open-generation benchmarks (e.g., AlpacaEval), with a 37% reduction in generation steps, 2.1× inference speedup, and significant improvements in coherence and relevance.
📝 Abstract
Autoregressive (AR) language models generate text one token at a time, which limits their inference speed. Diffusion-based language models offer a promising alternative, as they can decode multiple tokens in parallel. However, we identify a key bottleneck in current diffusion LMs: the long decoding-window problem, where tokens generated far from the input context often become irrelevant or repetitive. Previous solutions like semi-autoregressive address this issue by splitting windows into blocks, but this sacrifices speed and bidirectionality, eliminating the main advantage of diffusion models. To overcome this, we propose Convolutional decoding (Conv), a normalization-based method that narrows the decoding window without hard segmentation, leading to better fluency and flexibility. Additionally, we introduce Rejecting Rule-based Fine-Tuning (R2FT), a post-hoc training scheme that better aligns tokens at positions far from context. Our methods achieve state-of-the-art results on open-ended generation benchmarks (e.g., AlpacaEval) among diffusion LM baselines, with significantly lower step size than previous works, demonstrating both speed and quality improvements.