🤖 AI Summary
Diffusion models incur substantial computational overhead, hindering their deployment in real-time physical simulation. To address this, we propose training diffusion models in the latent space of a learned autoencoder—rather than in pixel space—to drastically reduce dimensionality and accelerate inference. Experiments demonstrate that even with a 1000× latent-space compression ratio, our approach preserves high temporal fidelity in system evolution. Compared to non-generative baselines, it achieves superior predictive diversity and robustness under distributional shift. Our key contribution lies in the first systematic validation of latent-space diffusion modeling for dynamical system simulation. By jointly optimizing the encoder architecture, diffusion backbone, and training strategy, we reduce computational cost by over 90% while simultaneously improving generative quality and uncertainty quantification capability.
📝 Abstract
The steep computational cost of diffusion models at inference hinders their use as fast physics emulators. In the context of image and video generation, this computational drawback has been addressed by generating in the latent space of an autoencoder instead of the pixel space. In this work, we investigate whether a similar strategy can be effectively applied to the emulation of dynamical systems and at what cost. We find that the accuracy of latent-space emulation is surprisingly robust to a wide range of compression rates (up to 1000x). We also show that diffusion-based emulators are consistently more accurate than non-generative counterparts and compensate for uncertainty in their predictions with greater diversity. Finally, we cover practical design choices, spanning from architectures to optimizers, that we found critical to train latent-space emulators.