🤖 AI Summary
Discrete diffusion models have demonstrated strong performance in language modeling, yet the learning order of data support structure and frequency information during denoising remains unclear. This work theoretically and empirically reveals that such models first learn the support set—such as grammatical validity—and subsequently refine internal frequency distributions. We establish, for the first time, that under low-noise conditions, a single-step reverse edit decomposes into a leading-order term governing support membership and a finer coefficient capturing frequency characteristics. Furthermore, we elucidate a hierarchical mechanistic distinction between uniform and absorbing diffusion processes. Through asymptotic analysis, masked language diffusion models, and experiments on regular language tasks, we confirm that support identification precedes frequency ranking, and that both diffusion types exhibit the predicted rate separation phenomenon.
📝 Abstract
Discrete diffusion models are increasingly competitive for language modeling, yet it remains unclear how their denoising objectives organize learning. Although these objectives target the full data distribution, we show that the exact reverse process induces a hierarchy between coarse support information and finer frequency information. For uniform and absorbing (a.k.a. masking) diffusion, we prove that, in the small-noise regime of the final denoising steps, each single-token reverse edit decomposes into a leading scale, determined by whether it moves toward the data support (e.g., grammatically valid sentences), and a finer coefficient, determining relative probabilities within the same scale. Thus, recovering validity structure only requires learning the correct order of magnitude of reverse probabilities, whereas recovering data frequencies requires coefficient-level estimation. The separation is mechanism-dependent: uniform diffusion exhibits a trichotomy into validity-improving, validity-preserving, and validity-worsening edits, while absorbing diffusion places its leading-order mass on validity-improving moves. Experiments on a masked language diffusion model and synthetic regular-language tasks support these predictions: support-localization emerges earlier than within-support frequency ranking, and the contrast between uniform and absorbing diffusion matches the predicted rate separation. Together, our results suggest that discrete diffusion models learn data support before data frequencies.