Incorporating Flexible Image Conditioning into Text-to-Video Diffusion Models without Training

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image-to-video (TI2V) generation suffers from limited flexibility in visual conditioning, reliance on fine-tuning, and restricted conditioning configurations. This paper proposes a training-free, general-purpose TI2V framework enabling injection of an arbitrary number of image conditions at arbitrary temporal positions. Key contributions include: (1) latent-space image inversion for semantic alignment; (2) stochastic local feature block shuffling to enhance spatiotemporal consistency; (3) a dynamic frame-level conditioning strength modulation mechanism for improved controllability; and (4) implicit multimodal condition fusion via diffusion models. Our method significantly outperforms existing training-free approaches across multiple benchmarks, achieving superior video fidelity alongside fine-grained control. Ablation studies comprehensively validate the efficacy of each component.

Technology Category

Application Category

📝 Abstract
Text-image-to-video (TI2V) generation is a critical problem for controllable video generation using both semantic and visual conditions. Most existing methods typically add visual conditions to text-to-video (T2V) foundation models by finetuning, which is costly in resources and only limited to a few predefined conditioning settings. To tackle this issue, we introduce a unified formulation for TI2V generation with flexible visual conditioning. Furthermore, we propose an innovative training-free approach, dubbed FlexTI2V, that can condition T2V foundation models on an arbitrary amount of images at arbitrary positions. Specifically, we firstly invert the condition images to noisy representation in a latent space. Then, in the denoising process of T2V models, our method uses a novel random patch swapping strategy to incorporate visual features into video representations through local image patches. To balance creativity and fidelity, we use a dynamic control mechanism to adjust the strength of visual conditioning to each video frame. Extensive experiments validate that our method surpasses previous training-free image conditioning methods by a notable margin. We also show more insights of our method by detailed ablation study and analysis.
Problem

Research questions and friction points this paper is trying to address.

Enabling flexible visual conditioning in text-to-video models without training
Overcoming resource-intensive finetuning for image-conditioned video generation
Balancing creativity and fidelity in multi-image video synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free image conditioning for T2V models
Random patch swapping for visual feature integration
Dynamic control mechanism balancing creativity and fidelity
🔎 Similar Papers
No similar papers found.
Bolin Lai
Bolin Lai
Georgia Institute of Technology
Multimodal LearningLLMImage GenerationVideo Generation
S
Sangmin Lee
Sungkyunkwan University
X
Xu Cao
University of Illinois Urbana-Champaign
X
Xiang Li
University of Illinois Urbana-Champaign
J
J. Rehg
University of Illinois Urbana-Champaign