🤖 AI Summary
This work addresses the quality degradation in long-horizon generation with causal autoregressive video diffusion models, which arises from a mismatch between the historical state distributions during training and inference. To resolve this, the authors propose the RAVEN framework, which restructures autoregressive rollouts during training into sequences alternating between clean historical endpoints and noisy denoising states, thereby aligning the attention mechanism used in training with the extrapolation behavior required at inference. Furthermore, they introduce CM-GRPO, a novel method that directly formulates the conditional Gaussian transition kernel of consistency models as a reinforcement learning policy, enabling online group relative policy optimization without auxiliary procedures. Experiments demonstrate that this approach significantly outperforms existing causal video distillation methods in terms of video quality, semantic consistency, and dynamic fidelity, validating both the efficacy of train-test alignment and the feasibility of end-to-end reinforcement learning integration within consistency models.
📝 Abstract
Causal autoregressive video diffusion models support real-time streaming generation by extrapolating future chunks from previously generated content. Distilling such generators from high-fidelity bidirectional teachers yields competitive few-step models, yet a persistent gap between the history distributions encountered during training and those arising at inference constrains generation quality over long horizons. We introduce the Real-time Autoregressive Video Extrapolation Network (RAVEN), a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This formulation aligns training attention with inference-time extrapolation and allows downstream chunk losses to supervise the history representations on which future predictions depend. We further propose Consistency-model Group Relative Policy Optimization (CM-GRPO), which reformulates a consistency sampling step as a conditional Gaussian transition and applies online Reinforcement Learning (RL) directly to this kernel, avoiding the Euler-Maruyama auxiliary process adopted in prior flow-model RL formulations. Experiments demonstrate that RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations, and that CM-GRPO provides further gains when combined with RAVEN.