Detect Anything 3D in the Wild

πŸ“… 2025-04-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing monocular 3D object detection methods operate under closed-set assumptions, exhibiting poor generalization to unseen object categories and novel camera configurations. To address this limitation, we propose DetAny3Dβ€”the first framework enabling open-world zero-shot monocular 3D detection. Our approach pioneers the transfer of knowledge from 2D foundation models (SAM and CLIP) to 3D detection via a novel 2D aggregator and a zero-embedding mapping 3D interpreter, effectively mitigating catastrophic forgetting during cross-dimensional adaptation. We further integrate feature alignment, 3D geometric disentanglement modeling, and monocular depth priors for robust scene understanding. Evaluated on both unseen categories and new camera setups, DetAny3D achieves state-of-the-art performance, while also surpassing most prior methods on standard benchmarks. This significantly enhances generalization to rare or novel objects in real-world applications such as autonomous driving.

Technology Category

Application Category

πŸ“ Abstract
Despite the success of deep learning in close-set 3D object detection, existing approaches struggle with zero-shot generalization to novel objects and camera configurations. We introduce DetAny3D, a promptable 3D detection foundation model capable of detecting any novel object under arbitrary camera configurations using only monocular inputs. Training a foundation model for 3D detection is fundamentally constrained by the limited availability of annotated 3D data, which motivates DetAny3D to leverage the rich prior knowledge embedded in extensively pre-trained 2D foundation models to compensate for this scarcity. To effectively transfer 2D knowledge to 3D, DetAny3D incorporates two core modules: the 2D Aggregator, which aligns features from different 2D foundation models, and the 3D Interpreter with Zero-Embedding Mapping, which mitigates catastrophic forgetting in 2D-to-3D knowledge transfer. Experimental results validate the strong generalization of our DetAny3D, which not only achieves state-of-the-art performance on unseen categories and novel camera configurations, but also surpasses most competitors on in-domain data.DetAny3D sheds light on the potential of the 3D foundation model for diverse applications in real-world scenarios, e.g., rare object detection in autonomous driving, and demonstrates promise for further exploration of 3D-centric tasks in open-world settings. More visualization results can be found at DetAny3D project page.
Problem

Research questions and friction points this paper is trying to address.

Detect novel 3D objects in wild settings
Generalize to unseen camera configurations
Overcome limited annotated 3D data availability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Promptable 3D detection model for novel objects
Leverages pre-trained 2D models for 3D knowledge
2D Aggregator and 3D Interpreter modules enhance transfer
πŸ”Ž Similar Papers
No similar papers found.
H
Hanxue Zhang
OpenDriveLab at Shanghai AI Laboratory, Shanghai Jiao Tong University
H
Haoran Jiang
OpenDriveLab at Shanghai AI Laboratory, Fudan University
Qingsong Yao
Qingsong Yao
Stanford University | ICT, CAS
Medical Image ComputingMedical Image Analysis
Y
Yanan Sun
OpenDriveLab at Shanghai AI Laboratory
Renrui Zhang
Renrui Zhang
Seed ByteDance & MMLab & PKU
Large Multimodal ModelGenerative ModelEmbodied AI
H
Hao Zhao
Tsinghua University
H
Hongyang Li
OpenDriveLab at Shanghai AI Laboratory
Hongzi Zhu
Hongzi Zhu
Shanghai Jiao Tong University
Mobile computingvehicular networksInternet of Things
Z
Zetong Yang
OpenDriveLab at Shanghai AI Laboratory, GAC R&D Center