Interactive Video Generation via Domain Adaptation

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key challenges in interactive video generation (IVG): (1) imprecise user control over object motion trajectories, and (2) degraded perceptual quality in existing training-free methods due to attention masking. We propose a fine-tuning-free diffusion model optimization framework. Our core contributions are: (1) Mask Normalization, which mitigates internal covariate shift induced by attention masks; and (2) a temporal intrinsic diffusion prior that bridges the initialization gap between initial noise and interactive conditions. By integrating domain adaptation, distribution matching, and temporal consistency modeling, our method preserves text-conditional generation capability while significantly improving trajectory controllability and video visual fidelity. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art IVG baselines across multiple quantitative metrics—including FVD, LPIPS, and trajectory error—as well as perceptual evaluation scores.

Technology Category

Application Category

📝 Abstract
Text-conditioned diffusion models have emerged as powerful tools for high-quality video generation. However, enabling Interactive Video Generation (IVG), where users control motion elements such as object trajectory, remains challenging. Recent training-free approaches introduce attention masking to guide trajectory, but this often degrades perceptual quality. We identify two key failure modes in these methods, both of which we interpret as domain shift problems, and propose solutions inspired by domain adaptation. First, we attribute the perceptual degradation to internal covariate shift induced by attention masking, as pretrained models are not trained to handle masked attention. To address this, we propose mask normalization, a pre-normalization layer designed to mitigate this shift via distribution matching. Second, we address initialization gap, where the randomly sampled initial noise does not align with IVG conditioning, by introducing a temporal intrinsic diffusion prior that enforces spatio-temporal consistency at each denoising step. Extensive qualitative and quantitative evaluations demonstrate that mask normalization and temporal intrinsic denoising improve both perceptual quality and trajectory control over the existing state-of-the-art IVG techniques.
Problem

Research questions and friction points this paper is trying to address.

Enabling interactive control of object motion in video generation
Addressing perceptual quality degradation from attention masking
Aligning initial noise sampling with interactive conditioning constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mask normalization mitigates internal covariate shift
Temporal intrinsic diffusion prior ensures consistency
Domain adaptation solves attention masking issues
🔎 Similar Papers
No similar papers found.