🤖 AI Summary
This work addresses the lack of theoretical understanding of the information generation mechanism in the reverse process of discrete diffusion models, which leads to inefficient sampling. From the perspective of thermodynamic entropy production, the study introduces— for the first time—the entropy production rate and Wasserstein distance into the design of discrete diffusion sampling schedules. It proposes two physically inspired strategies: Entropy-Driven Scheduling (EDS) with constant information gain and Wasserstein-Distance-based Scheduling (WDS) with equal Wasserstein step sizes. Evaluated across diverse tasks including synthetic data generation, symbolic music modeling, vision, and language modeling, the proposed methods consistently outperform state-of-the-art sampling strategies while requiring significantly lower computational cost.
📝 Abstract
Discrete diffusion models have emerged as a powerful paradigm for generative modeling on sequence data; however, the information-theoretic principles governing their reverse processes remain significantly less understood than those of their continuous counterparts. In this work, we bridge this gap by analyzing the reverse process dynamics through the lens of thermodynamic entropy production. We propose the entropy production rate as a rigorous proxy for quantifying information generation, deriving as a byproduct a bound on the Wasserstein distance between intermediate states and the data distribution. Leveraging these insights, we introduce two novel sampling schedules that are uniformly spaced with respect to their corresponding physics-inspired metrics: the Entropic Discrete Schedule (EDS), which is defined by maintaining a constant rate of information gain, and the Wasserstein Discrete Schedule (WDS), which is defined by taking equal steps in terms of the Wasserstein distance. We empirically demonstrate that our proposed schedules significantly outperform state-of-the-art strategies across diverse application domains, including synthetic data, music notation, vision and language modeling, consistently achieving superior performance at a lower computational budget.