Back to the Features: DINO as a Foundation for Video World Models

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work introduces the first general-purpose video world model built upon the pre-trained DINOv2 image encoder, designed to unify temporal dynamics modeling across diverse domains—including autonomous driving, indoor scenes, and simulation—while enabling accurate future frame prediction. Methodologically, it freezes DINOv2 as a fixed visual backbone and trains a lightweight sequence prediction network on large-scale, uncurated video data, directly modeling temporal evolution in the DINOv2 feature space; the framework supports action-conditional fine-tuning and latent-space trajectory planning. The key contribution is the first successful adaptation of DINOv2 to video world modeling, enabling cross-domain generalization, explicit incorporation of physical priors, and annotation-free universal representation learning. Experiments demonstrate substantial improvements over state-of-the-art methods on multi-task benchmarks—including video segmentation and depth estimation—and validate effective latent-space simulation-based planning.

Technology Category

Application Category

📝 Abstract

We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space.

Problem

Research questions and friction points this paper is trying to address.

Predict future frames in DINOv2 latent space

Learn temporal dynamics of diverse video scenes

Enable action-conditioned planning via latent trajectories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts future frames in DINOv2 latent space

Leverages pre-trained image encoder for dynamics

Fine-tunes on action trajectories for planning

🔎 Similar Papers

Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding