🤖 AI Summary
Existing video world models struggle to achieve precise 4D geometric control over both camera and multi-object motion within a unified framework. This work proposes a 4D geometry-aware video generation model that jointly represents dynamic scenes using a static background point cloud and 3D Gaussian trajectories for moving objects. By introducing a class-agnostic probabilistic 3D occupancy modeling approach, the method enables explicit and unified control over camera and multi-object motion. Furthermore, we develop an unsupervised 4D data engine that automatically extracts 4D supervision signals from unlabeled videos to guide a pretrained video diffusion model, enabling the synthesis of high-fidelity, view-consistent, and controllable videos. Evaluated on large-scale real-world data, our approach significantly improves geometric consistency and motion controllability in generated videos.
📝 Abstract
Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos that precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing an automatic data engine that extracts the required 4D controls from in-the-wild videos, allowing us to train our model on a massive and diverse dataset.