Scalable Adaptation of 3D Geometric Foundation Models via Weak Supervision from Internet Video

📅 2026-02-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D geometric foundation models are constrained by the scarcity of large-scale, diverse annotated data. While internet videos offer abundant visual content, they lack accurate geometric labels and contain significant observational noise. To address this, this work proposes SAGE, a framework that enables scalable weakly supervised training of 3D geometric foundation assistants using in-the-wild videos for the first time. SAGE employs a hierarchical mining pipeline to extract training trajectories from videos, integrating sparse geometric anchoring—guided by Structure-from-Motion (SfM) point clouds for global structure—with dense differentiable multi-view consistency based on 3D Gaussian rendering. An anchor-data regularization strategy is introduced to mitigate catastrophic forgetting. Evaluated on unseen benchmarks including 7Scenes, TUM-RGBD, and Matterport3D, the model demonstrates substantially improved zero-shot generalization, reducing Chamfer Distance by 20–42% and establishing a new paradigm for universal 3D learning.

Technology Category

Application Category

📝 Abstract
Geometric foundation models show promise in 3D reconstruction, yet their progress is severely constrained by the scarcity of diverse, large-scale 3D annotations. While Internet videos offer virtually unlimited raw data, utilizing them as a scaling source for geometric learning is challenging due to the absence of ground-truth geometry and the presence of observational noise. To address this, we propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams. SAGE leverages a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision: (1) Informative training trajectory selection; (2) Sparse Geometric Anchoring via SfM point clouds for global structural guidance; and (3) Dense Differentiable Consistency via 3D Gaussian rendering for multi-view constraints. To prevent catastrophic forgetting, we introduce a regularization strategy using anchor data. Extensive experiments show that SAGE significantly enhances zero-shot generalization, reducing Chamfer Distance by 20-42% on unseen benchmarks (7Scenes, TUM-RGBD, Matterport3D) compared to state-of-the-art baselines. To our knowledge, SAGE pioneers the adaptation of geometric foundation models via Internet video, establishing a scalable paradigm for general-purpose 3D learning.
Problem

Research questions and friction points this paper is trying to address.

3D reconstruction
geometric foundation models
weak supervision
Internet video
scalable learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable Adaptation
Weak Supervision
Geometric Foundation Models
Internet Video
3D Gaussian Rendering