VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work introduces a novel paradigm for arbitrary spatiotemporal video inpainting, enabling users to paste image patches at arbitrary spatial locations and temporal timestamps within a video. It unifies diverse tasks—including image-to-video generation, inpainting, outpainting, and frame interpolation—under a single framework. The core challenge lies in temporal ambiguity of latent representations induced by causal VAEs, which impedes frame-level precise control. To address this, the authors propose a hybrid conditional control strategy: combining zero-padded masks with time-aware RoPE interpolation, enabling fine-grained spatiotemporal alignment without introducing new parameters or finetuning the frozen latent video diffusion model. Built upon the In-Context Conditioning framework, the method achieves state-of-the-art performance on the newly constructed benchmark VideoCanvasBench, demonstrating both high-fidelity intra-scene reconstruction and strong cross-scene generalization capabilities.

Technology Category

Application Category

📝 Abstract

We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks--including first-frame image-to-video, inpainting, extension, and interpolation--under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE's temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.

Problem

Research questions and friction points this paper is trying to address.

Generating videos from arbitrary spatiotemporal patches

Resolving temporal ambiguity in latent video diffusion models

Unifying multiple controllable video generation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Context Conditioning for video completion

Hybrid strategy decouples spatial-temporal control

Temporal RoPE resolves VAE temporal ambiguity

🔎 Similar Papers

No similar papers found.