Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

📅 2024-06-06

🏛️ arXiv.org

📈 Citations: 9

✨ Influential: 2

career value

201K/year

🤖 AI Summary

This work addresses the inefficiency of estimating marginal probability ratios—termed “concrete scores”—in absorption-based discrete diffusion models. We establish, for the first time, an analytical equivalence between concrete scores and clean-data conditional probabilities: concrete scores decompose into the product of the conditional probability and a closed-form time-dependent factor. Leveraging this insight, we propose RADD (Time-Invariant Reparameterized Absorption Diffusion), a reparameterization that eliminates explicit dependence on timestep indices and enables NFE (number-of-function-evaluations) caching for accelerated sampling. Theoretically, our framework unifies performance bounds for absorption diffusion and arbitrary-order autoregressive models. Empirically, RADD achieves state-of-the-art perplexity among diffusion-based language models at the GPT-2 scale across five zero-shot language modeling benchmarks. Code is publicly available.

Technology Category

Application Category

📝 Abstract

Discrete diffusion models with absorbing processes have shown promise in language modeling. The key quantities to be estimated are the ratios between the marginal probabilities of two transitive states at all timesteps, called the concrete score. In this paper, we reveal that the concrete score in absorbing diffusion can be expressed as conditional probabilities of clean data, multiplied by a time-dependent scalar in an analytic form. Motivated by this finding, we propose reparameterized absorbing discrete diffusion (RADD), a dedicated diffusion model without time-condition that characterizes the time-independent conditional probabilities. Besides its simplicity, RADD can reduce the number of function evaluations (NFEs) by caching the output of the time-independent network when the noisy sample remains unchanged in a sampling interval, which enables sampling acceleration. Built upon the new perspective of conditional distributions, we further unify absorbing discrete diffusion and any-order autoregressive models (AO-ARMs), showing that the upper bound on the negative log-likelihood for the diffusion model can be interpreted as an expected negative log-likelihood for AO-ARMs. Further, our RADD models achieve SOTA performance among diffusion models on 5 zero-shot language modeling benchmarks (measured by perplexity) at the GPT-2 scale. Our code is available at https://github.com/ML-GSAI/RADD.

Problem

Research questions and friction points this paper is trying to address.

Models conditional distributions of clean data

Reduces function evaluations in diffusion models

Unifies discrete diffusion with autoregressive models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Absorbing diffusion models

Time-independent conditional probabilities

Sampling acceleration via caching

🔎 Similar Papers

A Survey on Diffusion Models for Time Series and Spatio-Temporal Data