Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the performance degradation and lack of theoretical guarantees in existing approaches when distilling bidirectional video diffusion models into autoregressive ones, primarily due to architectural mismatches. The authors propose Causal Forcing, a novel method that leverages an autoregressive teacher model to guide the initialization of the ordinary differential equation (ODE) solver, effectively bridging the architectural gap between bidirectional and causal attention mechanisms. This approach provides the first theoretical resolution to the frame-level non-injectivity issue inherent in autoregressive distillation, ensuring invertibility of the flow map and circumventing performance loss caused by conditional expectation solutions. Built upon an ODE-based distillation framework and integrating autoregressive diffusion with causal attention, the method achieves state-of-the-art results, surpassing prior art by 19.3%, 8.7%, and 16.7% on Dynamic Degree, VisionReward, and Instruction Following metrics, respectively, enabling high-quality, real-time interactive video generation.

Technology Category

Application Category

📝 Abstract

To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher's flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing that uses an AR teacher for ODE initialization, thereby bridging the architectural gap. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3\% in Dynamic Degree, 8.7\% in VisionReward, and 16.7\% in Instruction Following. Project page and the code: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}

Problem

Research questions and friction points this paper is trying to address.

autoregressive diffusion

video generation

architectural gap

causal attention

distillation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal Forcing

Autoregressive Diffusion

ODE Distillation