🤖 AI Summary
Existing video generation models are constrained by fixed input formats, limiting their ability to simultaneously support multi-granularity controllable generation—from 4D object trajectories or camera paths to coarse-grained text prompts—while balancing precise control in specified regions against diversity in unspecified ones. To address this, we propose a unified variational inference framework that employs annealed KL divergence optimization over a sequence of distributions and context-conditional factorization to avoid local optima and enable seamless integration of heterogeneous control signals. The framework is agnostic to backbone architectures and incorporates multiple state-of-the-art video generation models to enhance representational capacity. Experiments demonstrate significant improvements over prior work in control accuracy, generative diversity, and 3D spatiotemporal consistency. To our knowledge, this is the first approach achieving cross-granularity, high-fidelity, and robust controllable video synthesis.
📝 Abstract
Many video workflows benefit from a mixture of user controls with varying granularity, from exact 4D object trajectories and camera paths to coarse text prompts, while existing video generative models are typically trained for fixed input formats. We develop a video synthesis method that addresses this need and generates samples with high controllability for specified elements while maintaining diversity for under-specified ones. We cast the task as variational inference to approximate a composed distribution, leveraging multiple video generation backbones to account for all task constraints collectively. To address the optimization challenge, we break down the problem into step-wise KL divergence minimization over an annealed sequence of distributions, and further propose a context-conditioned factorization technique that reduces modes in the solution space to circumvent local optima. Experiments suggest that our method produces samples with improved controllability, diversity, and 3D consistency compared to prior works.