Enabling Visual Composition and Animation in Unsupervised Video Generation

📅 2024-03-21
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses controllable compositional video generation from unlabeled video data, proposing the first unsupervised end-to-end framework. Methodologically, it introduces— for the first time in unsupervised video generation—local self-supervised feature subsets derived from pretrained vision models (e.g., DINO) as fine-grained conditional controls, enabling disentangled modeling of object parts and editable motion dynamics. By integrating stochastic local conditioning with structured latent-space guidance, the framework supports on-the-fly composition of predefined parts during inference, yielding physically plausible dynamic videos. Extensive experiments across multiple benchmarks demonstrate that our approach significantly outperforms existing unsupervised methods in part-level controllability, video photorealism, and cross-scene generalization.

Technology Category

Application Category

📝 Abstract
In this work we propose a novel method for unsupervised controllable video generation. Once trained on a dataset of unannotated videos, at inference our model is capable of both composing scenes of predefined object parts and animating them in a plausible and controlled way. This is achieved by conditioning video generation on a randomly selected subset of local pre-trained self-supervised features during training. We call our model CAGE for visual Composition and Animation for video GEneration. We conduct a series of experiments to demonstrate capabilities of CAGE in various settings. Project website: https://araachie.github.io/cage.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised controllable video generation without annotations
Composing scenes and animating objects spatially and temporally
Learning scene compositionality and object dynamics via self-supervised features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised training on unannotated videos
Unified control format with self-supervised features
Spatiotemporal inpainting for scene compositionality
🔎 Similar Papers
No similar papers found.
A
Aram Davtyan
Computer Vision Group, Institute of Informatics, University of Bern, Switzerland
S
Sepehr Sameni
Computer Vision Group, Institute of Informatics, University of Bern, Switzerland
B
Bjorn Ommer
CompVis @ LMU Munich and MCML, Germany
Paolo Favaro
Paolo Favaro
Professor of Computer Vision, University of Bern
computer visionmachine learningcomputational photographyinverse problemsoptimization methods