🤖 AI Summary
Diffusion language models suffer from uncontrolled and inconsistent generation due to the absence of precise probabilistic estimates in their denoising process. To address this, we propose entropy-driven token-weighted supervised fine-tuning: leveraging uncertainty—quantified via information entropy—during the diffusion process to dynamically identify and weight critical tokens, thereby enhancing explicit control over generation trajectories. Our method requires no architectural modifications and achieves efficient few-shot adaptation through a simple entropy-based weighting scheme applied to the loss function. Evaluated on four rigorous reasoning benchmarks—Sudoku, Countdown, GSM8K, and MATH-500—using only 1K, 1.1K, and 3K samples, it yields up to an 83% relative improvement in performance, significantly boosting both generation consistency and task accuracy. The core innovation lies in introducing information entropy into supervised fine-tuning for diffusion language models, enabling the first uncertainty-aware, fine-grained, token-level control mechanism.
📝 Abstract
Diffusion models have recently shown strong potential in language modeling, offering faster generation compared to traditional autoregressive approaches. However, applying supervised fine-tuning (SFT) to diffusion models remains challenging, as they lack precise probability estimates at each denoising step. While the diffusion mechanism enables the model to reason over entire sequences, it also makes the generation process less predictable and often inconsistent. This highlights the importance of controlling key tokens that guide the direction of generation. To address this issue, we propose WeFT, a weighted SFT method for diffusion language models, where tokens are assigned different weights based on their entropy. Derived from diffusion theory, WeFT delivers substantial gains: training on s1K, s1K-1.1, and 3k samples from open-r1, it achieves relative improvements of 39%, 64%, and 83% over standard SFT on four widely used reasoning benchmarks (Sudoku, Countdown, GSM8K, and MATH-500). The code and models will be made publicly available.