🤖 AI Summary
This work addresses the performance limitations of existing monocular 3D object detection methods in wild scenes, primarily caused by the scarcity of high-quality 3D annotations. To overcome this, the authors propose a 3D scene reconstruction approach based on an analysis-by-synthesis framework, which leverages foundation models to drive an automated annotation pipeline. This pipeline efficiently generates open-vocabulary, large-scale 3D bounding box annotations from single 2D images and constructs the COCO3D benchmark dataset based on MS-COCO. By eliminating reliance on manual labeling or constrained environments, the method breaks through traditional bottlenecks in 3D annotation. The resulting pseudo-labels significantly improve performance across multiple monocular 3D detection benchmarks and surpass existing automatic annotation techniques in quality, demonstrating strong scalability and effectiveness for open-world 3D recognition.
📝 Abstract
Detecting objects in 3D space from monocular input is crucial for applications ranging from robotics to scene understanding. Despite advanced performance in the indoor and autonomous driving domains, existing monocular 3D detection models struggle with in-the-wild images due to the lack of 3D in-the-wild datasets and the challenges of 3D annotation. We introduce LabelAny3D, an \emph{analysis-by-synthesis} framework that reconstructs holistic 3D scenes from 2D images to efficiently produce high-quality 3D bounding box annotations. Built on this pipeline, we present COCO3D, a new benchmark for open-vocabulary monocular 3D detection, derived from the MS-COCO dataset and covering a wide range of object categories absent from existing 3D datasets. Experiments show that annotations generated by LabelAny3D improve monocular 3D detection performance across multiple benchmarks, outperforming prior auto-labeling approaches in quality. These results demonstrate the promise of foundation-model-driven annotation for scaling up 3D recognition in realistic, open-world settings.