🤖 AI Summary
Existing video variational autoencoders (VVAEs) struggle to simultaneously achieve high reconstruction fidelity, temporal coherence in latent representations, and diffusibility, thereby limiting generative performance. This work introduces predictive world modeling into the VVAE framework for the first time, proposing a unified reconstruction–prediction objective: encoding only a subset of past frames to jointly reconstruct observed frames and predict future ones. A stochastic future-frame dropout strategy is incorporated during training to end-to-end optimize the latent dynamics structure. The proposed approach substantially enhances both temporal consistency and generative capability of the latent representations, achieving a 34.42 improvement in FVD over Wan2.2 VAE on UCF101, accelerating convergence by 52%, and consistently yielding performance gains in downstream video understanding tasks.
📝 Abstract
Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the visual world into compact spatiotemporal latent spaces, improving training efficiency and stability. While existing video VAEs achieve commendable reconstruction quality, continued optimization of reconstruction does not necessarily translate into improved generative performance. How to enhance the diffusability of video latents remains a critical and unresolved challenge. In this work, inspired by principles of predictive world modeling, we investigate the potential of predictive learning to improve the video generative modeling. To this end, we introduce a simple and effective predictive reconstruction objective that unifies predictive learning with video reconstruction. Specifically, we randomly discard future frames and encode only partial past observations, while training the decoder to reconstruct the observed frames and predict future ones simultaneously. This design encourages the latent space to encode temporally predictive structures and build a more coherent understanding of video dynamics, thereby improving generation quality. Our model, termed Predictive Video VAE (PV-VAE), achieves superior performance on video generation, with 52% faster convergence and a 34.42 FVD improvement over the Wan2.2 VAE on UCF101. Furthermore, comprehensive analyses demonstrate that PV-VAE not only exhibits favorable scalability, with generative performance improving alongside VAE training, but also yields consistent gains in downstream video understanding, underscoring a latent space that effectively captures temporal coherence and motion priors.