🤖 AI Summary
To address error accumulation inherent in autoregressive (AR) paradigms for long-term video prediction, this paper proposes the first AR-free diffusion model framework, eliminating frame-by-frame generation and enabling end-to-end direct mapping from contextual frame tuples to future frame tuples. Our key contributions are: (1) a novel explicit motion-feature-driven motion prediction module that disentangles and models dynamic priors; and (2) a joint training strategy combining tuple-level generation with continuity regularization to ensure temporal coherence and contextual consistency. Evaluated on KTH and BAIR benchmarks, our method achieves significant improvements over state-of-the-art approaches—up to +1.2 dB in PSNR and +0.03 in SSIM for distant future frames—effectively mitigating error propagation while preserving both visual fidelity and temporal stability.
📝 Abstract
Existing long-term video prediction methods often rely on an autoregressive video prediction mechanism. However, this approach suffers from error propagation, particularly in distant future frames. To address this limitation, this paper proposes the first AutoRegression-Free (ARFree) video prediction framework using diffusion models. Different from an autoregressive video prediction mechanism, ARFree directly predicts any future frame tuples from the context frame tuple. The proposed ARFree consists of two key components: 1) a motion prediction module that predicts a future motion using motion feature extracted from the context frame tuple; 2) a training method that improves motion continuity and contextual consistency between adjacent future frame tuples. Our experiments with two benchmark datasets show that the proposed ARFree video prediction framework outperforms several state-of-the-art video prediction methods.