🤖 AI Summary
Cascade diffusion models suffer from intractable marginalization across stages, preventing exact likelihood computation and limiting performance in density estimation, lossless compression, and out-of-distribution (OOD) detection. This work introduces the first analytically tractable cascade diffusion model grounded in hierarchical volume-preserving transformations—specifically Laplacian pyramids and wavelet transforms—that decompose images losslessly into multiscale latent variables, thereby circumventing marginalization entirely. Theoretically, we establish the first equivalence between this framework and score matching under the Earth Mover’s Distance (EMD), unifying perceptual quality modeling with statistical density estimation. Empirically, our model achieves state-of-the-art performance across all three benchmarks—density estimation, lossless compression, and OOD detection—while enabling high-resolution synthesis and exact probabilistic inference.
📝 Abstract
Cascaded models are multi-scale generative models with a marked capacity for producing perceptually impressive samples at high resolutions. In this work, we show that they can also be excellent likelihood models, so long as we overcome a fundamental difficulty with probabilistic multi-scale models: the intractability of the likelihood function. Chiefly, in cascaded models each intermediary scale introduces extraneous variables that cannot be tractably marginalized out for likelihood evaluation. This issue vanishes by modeling the diffusion process on latent spaces induced by a class of transformations we call hierarchical volume-preserving maps, which decompose spatially structured data in a hierarchical fashion without introducing local distortions in the latent space. We demonstrate that two such maps are well-known in the literature for multiscale modeling: Laplacian pyramids and wavelet transforms. Not only do such reparameterizations allow the likelihood function to be directly expressed as a joint likelihood over the scales, we show that the Laplacian pyramid and wavelet transform also produces significant improvements to the state-of-the-art on a selection of benchmarks in likelihood modeling, including density estimation, lossless compression, and out-of-distribution detection. Investigating the theoretical basis of our empirical gains we uncover deep connections to score matching under the Earth Mover's Distance (EMD), which is a well-known surrogate for perceptual similarity. Code can be found at href{https://github.com/lihenryhfl/pcdm}{this https url}.