🤖 AI Summary
Existing autoregressive video generation models struggle to capture multi-shot narratives, often suffering from motion stagnation and semantic drift. This work proposes CausalCine, a framework that achieves coherent multi-shot video synthesis within an autoregressive paradigm for the first time. CausalCine models generation as an online directing process, enabling dynamic prompt injection, cross-shot causal generation, and contextual reuse. It introduces a Content-Aware Memory Routing (CAMR) mechanism that retrieves relevant historical key-value (KV) cache entries based on attention correlations, combined with model distillation to enable efficient real-time inference. While preserving the streaming and causal nature of autoregressive generation, CausalCine significantly outperforms existing autoregressive baselines in generation quality—approaching that of bidirectional models—and supports real-time interactive demonstrations.
📝 Abstract
Autoregressive video generation aims at real-time, open-ended synthesis. Yet, cinematic storytelling is not merely the endless extension of a single scene; it requires progressing through evolving events, viewpoint shifts, and discrete shot boundaries. Existing autoregressive models often struggle in this setting. Trained primarily for short-horizon continuation, they treat long sequences as extended single shots, inevitably suffering from motion stagnation and semantic drift during long rollouts. To bridge this gap, we introduce CausalCine, an interactive autoregressive framework that transforms multi-shot video generation into an online directing process. CausalCine generates causally across shot changes, accepts dynamic prompts on the fly, and reuses context without regenerating previous shots. To achieve this, we first train a causal base model on native multi-shot sequences to learn complex shot transitions prior to acceleration. We then propose Content-Aware Memory Routing (CAMR), which dynamically retrieves historical KV entries according to attention-based relevance scores rather than temporal proximity, preserving cross-shot coherence under bounded active memory. Finally, we distill the causal base model into a few-step generator for real-time interactive generation. Extensive experiments demonstrate that CausalCine significantly outperforms autoregressive baselines and approaches the capability of bidirectional models while unlocking the streaming interactivity of causal generation. Demo available at https://yihao-meng.github.io/CausalCine/