🤖 AI Summary
Addressing the critical bottleneck of scarce large-scale real 3D data hindering spatial intelligence development, this paper introduces the first end-to-end, scalable 2D→3D data augmentation pipeline, jointly performing monocular depth estimation, single-view camera calibration, and absolute scale recovery. For the first time, it synthesizes large-scale, metrically consistent, photorealistic 3D data—complete with precise pose annotations and multimodal representations (point clouds, depth maps, pseudo-RGB-D)—directly from existing 2D image benchmarks (COCO, Objects365). We release two novel 3D datasets: COCO-3D and Objects365-v2-3D. Experiments demonstrate substantial performance gains across downstream tasks—including monocular 3D object detection, 3D reconstruction, and spatial reasoning in multimodal large models—without requiring additional 3D supervision. Our approach drastically reduces the cost of constructing high-fidelity 3D data, establishing a scalable data infrastructure for general-purpose spatial intelligence.
📝 Abstract
Spatial intelligence is emerging as a transformative frontier in AI, yet it remains constrained by the scarcity of large-scale 3D datasets. Unlike the abundant 2D imagery, acquiring 3D data typically requires specialized sensors and laborious annotation. In this work, we present a scalable pipeline that converts single-view images into comprehensive, scale- and appearance-realistic 3D representations - including point clouds, camera poses, depth maps, and pseudo-RGBD - via integrated depth estimation, camera calibration, and scale calibration. Our method bridges the gap between the vast repository of imagery and the increasing demand for spatial scene understanding. By automatically generating authentic, scale-aware 3D data from images, we significantly reduce data collection costs and open new avenues for advancing spatial intelligence. We release two generated spatial datasets, i.e., COCO-3D and Objects365-v2-3D, and demonstrate through extensive experiments that our generated data can benefit various 3D tasks, ranging from fundamental perception to MLLM-based reasoning. These results validate our pipeline as an effective solution for developing AI systems capable of perceiving, understanding, and interacting with physical environments.