Controllable Video Synthesis via Variational Inference

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing video generation models are constrained by fixed input formats, limiting their ability to simultaneously support multi-granularity controllable generation—from 4D object trajectories or camera paths to coarse-grained text prompts—while balancing precise control in specified regions against diversity in unspecified ones. To address this, we propose a unified variational inference framework that employs annealed KL divergence optimization over a sequence of distributions and context-conditional factorization to avoid local optima and enable seamless integration of heterogeneous control signals. The framework is agnostic to backbone architectures and incorporates multiple state-of-the-art video generation models to enhance representational capacity. Experiments demonstrate significant improvements over prior work in control accuracy, generative diversity, and 3D spatiotemporal consistency. To our knowledge, this is the first approach achieving cross-granularity, high-fidelity, and robust controllable video synthesis.

Technology Category

Application Category

📝 Abstract

Many video workflows benefit from a mixture of user controls with varying granularity, from exact 4D object trajectories and camera paths to coarse text prompts, while existing video generative models are typically trained for fixed input formats. We develop a video synthesis method that addresses this need and generates samples with high controllability for specified elements while maintaining diversity for under-specified ones. We cast the task as variational inference to approximate a composed distribution, leveraging multiple video generation backbones to account for all task constraints collectively. To address the optimization challenge, we break down the problem into step-wise KL divergence minimization over an annealed sequence of distributions, and further propose a context-conditioned factorization technique that reduces modes in the solution space to circumvent local optima. Experiments suggest that our method produces samples with improved controllability, diversity, and 3D consistency compared to prior works.

Problem

Research questions and friction points this paper is trying to address.

Developing controllable video synthesis with variational inference methods

Addressing optimization challenges via KL divergence minimization techniques

Generating videos with improved controllability and 3D consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational inference for video synthesis control

Step-wise KL divergence minimization optimization

Context-conditioned factorization reduces solution modes

🔎 Similar Papers

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance