🤖 AI Summary
Existing general-purpose 3D foundation models predominantly rely solely on RGB inputs, neglecting readily available geometric cues—such as depth, camera intrinsics, and pose—thereby limiting spatial understanding. This work introduces Geo3D, a general-purpose 3D foundation model framework supporting arbitrary numbers of geometric modalities. Its core contributions are: (1) GeoAdapter—a zero-initialized convolutional module enabling progressive, unbiased injection of geometric information; (2) a stochastic multimodal subset sampling fusion strategy that balances training flexibility and inference efficiency; and (3) a scalable Transformer-based modality encoding architecture. Geo3D achieves state-of-the-art performance across monocular and multi-view 3D understanding benchmarks—even with RGB-only input—and further demonstrates consistent performance gains when integrated into vision-language-action models for robotic tasks.
📝 Abstract
General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Comprehensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks.