🤖 AI Summary
This work addresses the limitations of existing autoregressive diffusion distillation methods, which suffer from coarse response granularity and high sampling latency, hindering real-time interactive video generation at an ultra-low latency of 1–2 steps per frame. To overcome this, the authors propose Causal Forcing++, introducing a novel causal consistency distillation mechanism that enables high-quality video synthesis in just 1–2 steps per frame within an autoregressive framework, effectively balancing low latency and controllability. The method eliminates the need for precomputing full ODE trajectories by integrating online teacher ODE single-step supervision with autoregressive conditional flow mapping, substantially improving training efficiency and stability. Experiments demonstrate that under a 2-step-per-frame setting, Causal Forcing++ achieves gains of 0.1, 0.3, and 0.335 in VBench Total, Quality, and VisionReward scores, respectively, reduces first-frame latency by 50%, and cuts second-stage training cost by approximately fourfold.
📝 Abstract
Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose \textbf{Causal Forcing++}, a principled and scalable pipeline that uses \emph{causal consistency distillation} (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textit{\textbf{frame-wise 2-step setting}} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50\% and Stage 2 training cost by $\sim$$4\times$. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .