VGGT-$Ω$

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the challenges of accuracy, efficiency, and scalability in both static and dynamic scene reconstruction by proposing a novel architecture that integrates a register-based attention mechanism to compactly aggregate scene information while constraining inter-frame interactions. It replaces computationally expensive convolutional layers with a single-head, multi-task dense prediction head and leverages a self-supervised learning protocol to enable efficient large-scale training. The approach substantially reduces GPU memory consumption and effectively exploits massive amounts of both labeled and unlabeled video data. It achieves state-of-the-art performance across multiple benchmarks, including a 77% improvement in camera estimation accuracy on the Sintel dataset, and demonstrates that the learned registers significantly enhance spatial understanding in vision-language-action models.

📝 Abstract

Recent feed-forward reconstruction models, such as VGGT, have proven competitive with traditional optimization-based reconstructors while also providing geometry-aware features useful for other tasks. Here, we show that the quality of these models scales predictably with model and data size. We do so by introducing VGGT-$Ω$, which substantially improves reconstruction accuracy, efficiency, and capabilities for both static and dynamic scenes. To enable training this model at an unprecedented scale, we introduce architectural changes that improve training efficiency, a high-quality data annotation pipeline that supports dynamic scenes, and a self-supervised learning protocol. We simplify VGGT's architecture by using a single dense prediction head with multi-task supervision and removing the expensive high-resolution convolutional layers. We also use registers to aggregate scene information into a compact representation and introduce register attention, which restricts inter-frame information exchange to these registers, in part replacing global attention. In this way, during training, VGGT-$Ω$ uses only about 30% of the GPU memory of its predecessor, allowing us to train with 15x more supervised data than prior work and to leverage vast amounts of unlabeled video data. VGGT-$Ω$ achieves strong results for reconstruction of static and dynamic scenes across multiple benchmarks, for example, improving over the previous best camera estimation accuracy on Sintel by 77%. We also show that the learned registers can improve vision-language-action models and support alignment with language, suggesting that reconstruction can be a powerful and scalable proxy task for spatial understanding. Project Page: http://vggt-omega.github.io/

Problem

Research questions and friction points this paper is trying to address.

reconstruction

static and dynamic scenes

scalability

geometry-aware features

spatial understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

dense prediction head

self-supervised learning

geometry-aware reconstruction