🤖 AI Summary
Autoregressive video diffusion models suffer from exposure bias due to train-inference mismatch, resulting in poor temporal consistency in long-video generation. This paper proposes an end-to-end, teacher-free training framework to address this issue. Its core contributions are threefold: (1) a novel self-resampling mechanism that actively simulates erroneous inference trajectories during training to mitigate exposure bias; (2) a sparse causal mask combined with parameter-free historical routing to enhance inter-frame dependency modeling; and (3) joint optimization of frame-level diffusion losses, enabling native long-video synthesis without length constraints. Experiments demonstrate that our method achieves temporal coherence and generation stability significantly superior to knowledge distillation baselines—while maintaining comparable overall visual quality—particularly excelling in minute-scale video synthesis.
📝 Abstract
Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or online discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant history frames for each query. Experiments demonstrate that our approach achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos owing to native-length training.