🤖 AI Summary
Existing 3D geometric foundation models are constrained by the scarcity of large-scale, diverse annotated data. While internet videos offer abundant visual content, they lack accurate geometric labels and contain significant observational noise. To address this, this work proposes SAGE, a framework that enables scalable weakly supervised training of 3D geometric foundation assistants using in-the-wild videos for the first time. SAGE employs a hierarchical mining pipeline to extract training trajectories from videos, integrating sparse geometric anchoring—guided by Structure-from-Motion (SfM) point clouds for global structure—with dense differentiable multi-view consistency based on 3D Gaussian rendering. An anchor-data regularization strategy is introduced to mitigate catastrophic forgetting. Evaluated on unseen benchmarks including 7Scenes, TUM-RGBD, and Matterport3D, the model demonstrates substantially improved zero-shot generalization, reducing Chamfer Distance by 20–42% and establishing a new paradigm for universal 3D learning.
📝 Abstract
Geometric foundation models show promise in 3D reconstruction, yet their progress is severely constrained by the scarcity of diverse, large-scale 3D annotations. While Internet videos offer virtually unlimited raw data, utilizing them as a scaling source for geometric learning is challenging due to the absence of ground-truth geometry and the presence of observational noise. To address this, we propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams. SAGE leverages a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision: (1) Informative training trajectory selection; (2) Sparse Geometric Anchoring via SfM point clouds for global structural guidance; and (3) Dense Differentiable Consistency via 3D Gaussian rendering for multi-view constraints. To prevent catastrophic forgetting, we introduce a regularization strategy using anchor data. Extensive experiments show that SAGE significantly enhances zero-shot generalization, reducing Chamfer Distance by 20-42% on unseen benchmarks (7Scenes, TUM-RGBD, Matterport3D) compared to state-of-the-art baselines. To our knowledge, SAGE pioneers the adaptation of geometric foundation models via Internet video, establishing a scalable paradigm for general-purpose 3D learning.