VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the tendency of existing world models to overemphasize photometric details in video prediction, which often leads to inconsistent geometric structures. To mitigate this issue, the authors propose a novel world modeling paradigm based on frozen VGGT features, treating high-dimensional latent representations as the world state and forecasting their future evolution via a lightweight autoregressive temporal-flow Transformer, thereby circumventing direct pixel-level generation. To alleviate training instability and exposure bias, they introduce z-prediction parameterization together with a two-stage latent-flow-forced curriculum strategy. With only 0.43 billion trainable parameters, the method significantly outperforms the strongest baselines on KITTI, Cityscapes, and TartanAir, while achieving a 3.6–5× speedup in depth prediction tasks.

Technology Category

Application Category

📝 Abstract
World models that forecast scene evolution by generating future video frames devote the bulk of their capacity to photometric details, yet the resulting predictions often remain geometrically inconsistent. We present VGGT-World, a geometry world model that side-steps video generation entirely and instead forecasts the temporal evolution of frozen geometry-foundation-model (GFM) features. Concretely, we repurpose the latent tokens of a frozen VGGT as the world state and train a lightweight temporal flow transformer to autoregressively predict their future trajectory. Two technical challenges arise in this high-dimensional (d=1024) feature space: (i) standard velocity-prediction flow matching collapses, and (ii) autoregressive rollout suffers from compounding exposure bias. We address the first with a clean-target (z-prediction) parameterization that yields a substantially higher signal-to-noise ratio, and the second with a two-stage latent flow-forcing curriculum that progressively conditions the model on its own partially denoised rollouts. Experiments on KITTI, Cityscapes, and TartanAir demonstrate that VGGT-World significantly outperforms the strongest baselines in depth forecasting while running 3.6-5 times faster with only 0.43B trainable parameters, establishing frozen GFM features as an effective and efficient predictive state for 3D world modeling.
Problem

Research questions and friction points this paper is trying to address.

world models
geometric consistency
video prediction
3D scene forecasting
geometry-foundation-model
Innovation

Methods, ideas, or system contributions that make the work stand out.

geometry world model
frozen foundation model
autoregressive prediction
flow matching
latent curriculum learning