End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Autoregressive video diffusion models suffer from exposure bias due to train-inference mismatch, resulting in poor temporal consistency in long-video generation. This paper proposes an end-to-end, teacher-free training framework to address this issue. Its core contributions are threefold: (1) a novel self-resampling mechanism that actively simulates erroneous inference trajectories during training to mitigate exposure bias; (2) a sparse causal mask combined with parameter-free historical routing to enhance inter-frame dependency modeling; and (3) joint optimization of frame-level diffusion losses, enabling native long-video synthesis without length constraints. Experiments demonstrate that our method achieves temporal coherence and generation stability significantly superior to knowledge distillation baselines—while maintaining comparable overall visual quality—particularly excelling in minute-scale video synthesis.

Technology Category

Application Category

📝 Abstract

Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or online discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant history frames for each query. Experiments demonstrate that our approach achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos owing to native-length training.

Problem

Research questions and friction points this paper is trying to address.

Addresses exposure bias in autoregressive video diffusion models

Introduces teacher-free end-to-end training via self-resampling

Enables efficient long-horizon video generation with temporal consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-resampling scheme simulates inference errors during training

Sparse causal mask enables parallel training with temporal causality

Parameter-free history routing retrieves relevant frames for generation

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling