🤖 AI Summary
This work addresses controllable compositional video generation from unlabeled video data, proposing the first unsupervised end-to-end framework. Methodologically, it introduces— for the first time in unsupervised video generation—local self-supervised feature subsets derived from pretrained vision models (e.g., DINO) as fine-grained conditional controls, enabling disentangled modeling of object parts and editable motion dynamics. By integrating stochastic local conditioning with structured latent-space guidance, the framework supports on-the-fly composition of predefined parts during inference, yielding physically plausible dynamic videos. Extensive experiments across multiple benchmarks demonstrate that our approach significantly outperforms existing unsupervised methods in part-level controllability, video photorealism, and cross-scene generalization.
📝 Abstract
In this work we propose a novel method for unsupervised controllable video generation. Once trained on a dataset of unannotated videos, at inference our model is capable of both composing scenes of predefined object parts and animating them in a plausible and controlled way. This is achieved by conditioning video generation on a randomly selected subset of local pre-trained self-supervised features during training. We call our model CAGE for visual Composition and Animation for video GEneration. We conduct a series of experiments to demonstrate capabilities of CAGE in various settings. Project website: https://araachie.github.io/cage.