Lifting Unlabeled Internet-level Data for 3D Scene Understanding

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity and high cost of annotated 3D scene data by proposing an automated data engine that leverages the vast, untapped resource of unlabeled internet videos to generate multi-granularity 3D training data. The approach integrates 3D reconstruction, vision–language alignment, and end-to-end training to enable joint learning from both human-annotated and synthetically generated data. It identifies key bottlenecks in unsupervised 3D data generation and demonstrates, for the first time, the effectiveness of web-scale videos across a broad spectrum of tasks—ranging from low-level perception (e.g., 3D object detection and instance segmentation) to high-level semantic reasoning (e.g., spatial visual question answering and vision-and-language navigation). The resulting models exhibit strong zero-shot performance and achieve further gains after fine-tuning.
📝 Abstract
Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-evel reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.
Problem

Research questions and friction points this paper is trying to address.

3D scene understanding
unlabeled data
internet-scale data
data scarcity
annotation cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

unlabeled internet video
automated data generation
3D scene understanding
zero-shot learning
vision-language navigation
🔎 Similar Papers
No similar papers found.
Yixin Chen
Yixin Chen
Tsinghua University
BioinfomaticsMachine learning
Y
Yaowei Zhang
State Key Laboratory of General Artificial Intelligence, BIGAI
Huangyue Yu
Huangyue Yu
Beijing Institute for General Artificial Intelligence
computer vision and artificial intelligence
J
Junchao He
State Key Laboratory of General Artificial Intelligence, BIGAI; Beijing University of Posts and Telecommunications
Yan Wang
Yan Wang
Beijing Institute for General Artificial Intelligence
Scene Understanding
Jiangyong Huang
Jiangyong Huang
Peking University
Computer VisionArtificial Intelligence
H
Hongyu Shen
State Key Laboratory of General Artificial Intelligence, BIGAI; Beijing Institute of Technology
Junfeng Ni
Junfeng Ni
Tsinghua University
Computer Vision3D Reconstruction
S
Shaofei Wang
State Key Laboratory of General Artificial Intelligence, BIGAI
Baoxiong Jia
Baoxiong Jia
Ph.D. in Computer Science, UCLA
Computer VisionArtificial Intelligence
S
Song-Chun Zhu
State Key Laboratory of General Artificial Intelligence, BIGAI; Peking University; Tsinghua University
Siyuan Huang
Siyuan Huang
Beijing Institute for General Artificial Intelligence (BIGAI)
Embodied AI3D VisionRobotics3D Scene Understanding