WeFT: Weighted Entropy-driven Fine-Tuning for dLLMs

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Diffusion language models suffer from uncontrolled and inconsistent generation due to the absence of precise probabilistic estimates in their denoising process. To address this, we propose entropy-driven token-weighted supervised fine-tuning: leveraging uncertainty—quantified via information entropy—during the diffusion process to dynamically identify and weight critical tokens, thereby enhancing explicit control over generation trajectories. Our method requires no architectural modifications and achieves efficient few-shot adaptation through a simple entropy-based weighting scheme applied to the loss function. Evaluated on four rigorous reasoning benchmarks—Sudoku, Countdown, GSM8K, and MATH-500—using only 1K, 1.1K, and 3K samples, it yields up to an 83% relative improvement in performance, significantly boosting both generation consistency and task accuracy. The core innovation lies in introducing information entropy into supervised fine-tuning for diffusion language models, enabling the first uncertainty-aware, fine-grained, token-level control mechanism.

Technology Category

Application Category

📝 Abstract

Diffusion models have recently shown strong potential in language modeling, offering faster generation compared to traditional autoregressive approaches. However, applying supervised fine-tuning (SFT) to diffusion models remains challenging, as they lack precise probability estimates at each denoising step. While the diffusion mechanism enables the model to reason over entire sequences, it also makes the generation process less predictable and often inconsistent. This highlights the importance of controlling key tokens that guide the direction of generation. To address this issue, we propose WeFT, a weighted SFT method for diffusion language models, where tokens are assigned different weights based on their entropy. Derived from diffusion theory, WeFT delivers substantial gains: training on s1K, s1K-1.1, and 3k samples from open-r1, it achieves relative improvements of 39%, 64%, and 83% over standard SFT on four widely used reasoning benchmarks (Sudoku, Countdown, GSM8K, and MATH-500). The code and models will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Lack of precise probability estimates during diffusion model fine-tuning

Unpredictable and inconsistent generation in diffusion language models

Difficulty controlling key tokens that guide generation direction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weighted fine-tuning based on token entropy

Assigns different weights to tokens during training

Uses entropy from diffusion theory for supervision

🔎 Similar Papers

Step-by-Step Unmasking for Parameter-Efficient Fine-tuning of Large Language Models