🤖 AI Summary
This work introduces the first general-purpose video world model built upon the pre-trained DINOv2 image encoder, designed to unify temporal dynamics modeling across diverse domains—including autonomous driving, indoor scenes, and simulation—while enabling accurate future frame prediction. Methodologically, it freezes DINOv2 as a fixed visual backbone and trains a lightweight sequence prediction network on large-scale, uncurated video data, directly modeling temporal evolution in the DINOv2 feature space; the framework supports action-conditional fine-tuning and latent-space trajectory planning. The key contribution is the first successful adaptation of DINOv2 to video world modeling, enabling cross-domain generalization, explicit incorporation of physical priors, and annotation-free universal representation learning. Experiments demonstrate substantial improvements over state-of-the-art methods on multi-task benchmarks—including video segmentation and depth estimation—and validate effective latent-space simulation-based planning.
📝 Abstract
We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space.