๐ค AI Summary
Existing video diffusion models (VDMs) generate high-fidelity dynamic videos but require prohibitively expensive end-to-end training, limiting practical deployment. This paper proposes a discriminator-guided zero-finetuning paradigm that leverages pre-trained image diffusion models (e.g., DDPM) to directly synthesize spatiotemporally coherent videosโwithout architectural modifications or parameter updates. Our key contribution is the first design of a time-consistency discriminator, which imposes gradient-free spatiotemporal constraints during sampling to calibrate uncertainty and control bias. The method operates solely via inference-time optimization, drastically reducing computational overhead. Evaluated on idealized turbulence and global precipitation datasets, our approach achieves temporal consistency on par with fully trained VDMs, while successfully enabling century-scale, daily-resolution climate simulations.
๐ Abstract
Realistic temporal dynamics are crucial for many video generation, processing and modelling applications, e.g. in computational fluid dynamics, weather prediction, or long-term climate simulations. Video diffusion models (VDMs) are the current state-of-the-art method for generating highly realistic dynamics. However, training VDMs from scratch can be challenging and requires large computational resources, limiting their wider application. Here, we propose a time-consistency discriminator that enables pretrained image diffusion models to generate realistic spatiotemporal dynamics. The discriminator guides the sampling inference process and does not require extensions or finetuning of the image diffusion model. We compare our approach against a VDM trained from scratch on an idealized turbulence simulation and a real-world global precipitation dataset. Our approach performs equally well in terms of temporal consistency, shows improved uncertainty calibration and lower biases compared to the VDM, and achieves stable centennial-scale climate simulations at daily time steps.