Controllable Video Generation with Provable Disentanglement

📅 2025-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Controllable video generation faces challenges in modeling fine-grained spatiotemporal relationships and disentangling identity from motion. This paper proposes a disentanglement framework grounded in latent variable identifiability theory, establishing— for the first time—theoretically provable spatiotemporal disentanglement. We introduce the “Minimal Change Principle” and “Sufficient Variation Property” to rigorously guarantee disentanglement validity. Methodologically, we design a temporal transition module that jointly optimizes latent space dimensionality minimization and temporal conditional independence constraints within a GAN architecture, enabling precise separation of static identity and dynamic motion. Evaluated on multiple benchmarks, our approach significantly improves video quality and controllability, supports independent editing of identity and action, and achieves state-of-the-art performance in both qualitative and quantitative metrics.

Technology Category

Application Category

📝 Abstract
Controllable video generation remains a significant challenge, despite recent advances in generating high-quality and consistent videos. Most existing methods for controlling video generation treat the video as a whole, neglecting intricate fine-grained spatiotemporal relationships, which limits both control precision and efficiency. In this paper, we propose Controllable Video Generative Adversarial Networks (CoVoGAN) to disentangle the video concepts, thus facilitating efficient and independent control over individual concepts. Specifically, following the minimal change principle, we first disentangle static and dynamic latent variables. We then leverage the sufficient change property to achieve component-wise identifiability of dynamic latent variables, enabling independent control over motion and identity. To establish the theoretical foundation, we provide a rigorous analysis demonstrating the identifiability of our approach. Building on these theoretical insights, we design a Temporal Transition Module to disentangle latent dynamics. To enforce the minimal change principle and sufficient change property, we minimize the dimensionality of latent dynamic variables and impose temporal conditional independence. To validate our approach, we integrate this module as a plug-in for GANs. Extensive qualitative and quantitative experiments on various video generation benchmarks demonstrate that our method significantly improves generation quality and controllability across diverse real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

Controllable video generation
Disentangle video concepts
Independent control over motion and identity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangles static and dynamic variables
Ensures component-wise identifiability
Minimizes latent dynamic variables dimensionality
🔎 Similar Papers
No similar papers found.
Y
Yifan Shen
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
P
Peiyuan Zhu
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Z
Zijian Li
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Shaoan Xie
Shaoan Xie
Carnegie Mellon University
Representation LearningGenerative ModelCausality
Zeyu Tang
Zeyu Tang
Postdoctoral Scholar, Stanford University
Trustworthy AICausalityComputational Justice
Namrata Deka
Namrata Deka
Carnegie Mellon University
Machine LearningRobustnessRepresentation Learning
Zongfang Liu
Zongfang Liu
Zhejiang University & Westlake University
Machine LearningRepresentation LearningAI for Science
G
Guangyi Chen
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE; Carnegie Mellon University, Pittsburg, US
K
Kun Zhang
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE; Carnegie Mellon University, Pittsburg, US