🤖 AI Summary
Diffusion language models suffer from a training–inference mismatch under standard supervised fine-tuning: during training, they reconstruct only a single randomly masked token in one step, whereas inference relies on a multi-step denoising trajectory progressing from easy to hard tokens. This discrepancy leads to inefficient fine-tuning and limited gains in intrinsic model capabilities. To address this, this work proposes the TABOM framework, which, for the first time, leverages self-distilled inference trajectories not merely for inference acceleration but for capability enhancement. By modeling token reconstruction through a Boltzmann distribution, TABOM aligns the training objective with the structural properties of the inference trajectory and introduces a learnable pairwise ranking loss based on predictive entropy. This approach substantially mitigates catastrophic forgetting and effectively expands the model’s knowledge boundaries, yielding consistent performance improvements on out-of-domain tasks.
📝 Abstract
Diffusion Language Models (DLMs) have recently emerged as a promising alternative to autoregressive language models, offering stronger global awareness and highly parallel generation. However, post-training DLMs with standard Negative Evidence Lower Bound (NELBO)-based supervised fine-tuning remains inefficient: training reconstructs randomly masked tokens in a single step, whereas inference follows a confidence-guided, multi-step easy-to-hard denoising trajectory. Recent trajectory-based self-distillation methods exploit such inference trajectories mainly for sampling-step compression and acceleration, often improving decoding efficiency without substantially enhancing the model's underlying capability, and may even degrade performance under full diffusion decoding. In this work, we ask whether self-distilled trajectories can be used not merely for faster inference, but for genuine knowledge acquisition. Although these trajectories lie on the pretrained DLM's own distributional manifold and thus offer a potentially lower optimization barrier, we find that naively fine-tuning on them with standard NELBO objectives yields only marginal gains. To address this limitation, we propose \textbf{T}rajectory-\textbf{A}ligned optimization via \textbf{Bo}ltzmann \textbf{M}odeling (\textbf{TABOM}), a self-distilled trajectory-based post-training framework that aligns training with the easy-to-hard structure of inference. TABOM models the inference unmasking preference as a Boltzmann distribution over predictive entropies and derives a tractable pairwise ranking objective to align the model's certainty ordering with the observed decoding trajectory. Empirically, TABOM achieves substantial gains in new domains, expands the effective knowledge boundary of DLMs, and significantly mitigates catastrophic forgetting compared with standard SFT.