🤖 AI Summary
Despite their dominance, diffusion models for video generation suffer from challenges in spatiotemporal modeling, high computational cost, absence of native likelihood estimation, and limited causal predictive capability. This work pioneers the systematic investigation of normalizing flows (NFs) for autoregressive video generation. We propose a global-local invertible architecture coupled with a flow-score matching mechanism to enable end-to-end causal modeling in spatiotemporal latent space. Additionally, we introduce a causal denoiser and a video-aware Jacobian iterative parallelization strategy to mitigate temporal error accumulation. Experiments demonstrate that our approach achieves state-of-the-art performance in visual fidelity and motion consistency, significantly outperforms diffusion models in sampling efficiency, and supports multimodal generation—including text-to-video and image-to-video. By enabling tractable likelihood evaluation and interpretable latent dynamics, our framework establishes a novel paradigm for building explainable, quantitatively assessable world models.
📝 Abstract
Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at https://github.com/apple/ml-starflow.