🤖 AI Summary
While diffusion models excel in image/video generation, they struggle to capture high-level semantic conditional dependencies—such as physical laws or object stability—due to their implicit, global modeling of data.
Method: We propose Autoregressive Diffusion Models (AR-DM), which explicitly encode conditional dependencies among data tokens via an autoregressive structure over latent diffusion steps.
Contribution/Results: Theoretically, we derive the first sampling error upper bound for AR-DM and prove, under minimal data assumptions, that it achieves strictly better approximation of conditional distributions than standard DDPM; crucially, we identify the existence of conditional dependencies as the fundamental performance boundary. Empirically, AR-DM significantly outperforms DDPM on datasets with explicit conditional structures, matches its performance on dependency-free tasks—demonstrating robustness—and incurs only moderate inference overhead. This work establishes a new paradigm and theoretical foundation for enhancing the high-order semantic modeling capability of diffusion models.
📝 Abstract
Diffusion models have demonstrated appealing performance in both image and video generation. However, many works discover that they struggle to capture important, high-level relationships that are present in the real world. For example, they fail to learn physical laws from data, and even fail to understand that the objects in the world exist in a stable fashion. This is due to the fact that important conditional dependence structures are not adequately captured in the vanilla diffusion models. In this work, we initiate an in-depth study on strengthening the diffusion model to capture the conditional dependence structures in the data. In particular, we examine the efficacy of the auto-regressive (AR) diffusion models for such purpose and develop the first theoretical results on the sampling error of AR diffusion models under (possibly) the mildest data assumption. Our theoretical findings indicate that, compared with typical diffusion models, the AR variant produces samples with a reduced gap in approximating the data conditional distribution. On the other hand, the overall inference time of the AR-diffusion models is only moderately larger than that for the vanilla diffusion models, making them still practical for large scale applications. We also provide empirical results showing that when there is clear conditional dependence structure in the data, the AR diffusion models captures such structure, whereas vanilla DDPM fails to do so. On the other hand, when there is no obvious conditional dependence across patches of the data, AR diffusion does not outperform DDPM.