Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Addressing the critical bottleneck of scarce large-scale real 3D data hindering spatial intelligence development, this paper introduces the first end-to-end, scalable 2D→3D data augmentation pipeline, jointly performing monocular depth estimation, single-view camera calibration, and absolute scale recovery. For the first time, it synthesizes large-scale, metrically consistent, photorealistic 3D data—complete with precise pose annotations and multimodal representations (point clouds, depth maps, pseudo-RGB-D)—directly from existing 2D image benchmarks (COCO, Objects365). We release two novel 3D datasets: COCO-3D and Objects365-v2-3D. Experiments demonstrate substantial performance gains across downstream tasks—including monocular 3D object detection, 3D reconstruction, and spatial reasoning in multimodal large models—without requiring additional 3D supervision. Our approach drastically reduces the cost of constructing high-fidelity 3D data, establishing a scalable data infrastructure for general-purpose spatial intelligence.

Technology Category

Application Category

📝 Abstract

Spatial intelligence is emerging as a transformative frontier in AI, yet it remains constrained by the scarcity of large-scale 3D datasets. Unlike the abundant 2D imagery, acquiring 3D data typically requires specialized sensors and laborious annotation. In this work, we present a scalable pipeline that converts single-view images into comprehensive, scale- and appearance-realistic 3D representations - including point clouds, camera poses, depth maps, and pseudo-RGBD - via integrated depth estimation, camera calibration, and scale calibration. Our method bridges the gap between the vast repository of imagery and the increasing demand for spatial scene understanding. By automatically generating authentic, scale-aware 3D data from images, we significantly reduce data collection costs and open new avenues for advancing spatial intelligence. We release two generated spatial datasets, i.e., COCO-3D and Objects365-v2-3D, and demonstrate through extensive experiments that our generated data can benefit various 3D tasks, ranging from fundamental perception to MLLM-based reasoning. These results validate our pipeline as an effective solution for developing AI systems capable of perceiving, understanding, and interacting with physical environments.

Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of large-scale 3D datasets for AI spatial intelligence

Converting single-view 2D images into realistic 3D representations efficiently

Reducing costs and improving accessibility for 3D spatial data generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Converts single-view images to 3D representations

Integrates depth and camera calibration techniques

Generates scale-aware 3D data automatically

🔎 Similar Papers

Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance