🤖 AI Summary
This work addresses the limitations of existing video-to-video generation methods, which rely on non-causal full-sequence processing and fixed-prefix temporal concatenation, resulting in high latency and quadratic computational complexity that hinder real-time streaming and interactive camera control. To overcome these challenges, the authors propose RealCam, a real-time novel-view video synthesis framework based on autoregressive causal modeling. RealCam introduces a cross-frame context learning paradigm that eliminates the prefix bottleneck, incorporates loop-closed data augmentation to enhance temporal consistency, and integrates distribution-matching distillation with causal attention mechanisms for efficient inference. The method achieves state-of-the-art performance in visual fidelity and temporal coherence while accelerating inference by several orders of magnitude over prior approaches, enabling truly real-time and interactive camera-controlled video generation for the first time.
📝 Abstract
Camera-controlled video-to-video (V2V) generation enables dynamic viewpoint synthesis from monocular footage, holding immense potential for interactive filmmaking and live broadcasting. However, existing implicit synthesis methods fundamentally rely on non-causal, full-sequence processing and rigid prefix-style temporal concatenation. This architectural paradigm mandates bidirectional attention, resulting in prohibitive computational latency, quadratic complexity scaling, and inherent incompatibility with real-time streaming or variable-length inputs. To overcome these limitations, we introduce \texttt{RealCam}, a novel autoregressive framework for interactive, real-time camera-controlled V2V generation. We first design a high-fidelity teacher model grounded in a \textbf{Cross-frame In-context Learning} paradigm. By interleaving source and target frames into synchronized contextual pairs, our design inherently enables length-agnostic generalization and naturally facilitates causal adaptation, breaking the rigid prefix bottleneck. We then distill this teacher into a few-step causal student via Self-Forcing with Distribution Matching Distillation, enabling efficient, on-the-fly streaming synthesis. Furthermore, to mitigate severe loop inconsistency in closed-loop trajectories, we propose \textbf{Loop-Closed Data Augmentation (LoopAug)}, a novel paradigm that synthesizes globally consistent loop sequences from existing multiview datasets. Extensive experiments demonstrate that \texttt{RealCam} achieves state-of-the-art visual fidelity and temporal consistency while enabling truly interactive camera control with orders-of-magnitude faster inference than existing paradigms. Our project page is at https://xyc-fly.github.io/RealCam/.