3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing monocular 3D detection methods operate under closed-set assumptions, limiting their adaptability to novel environments and unseen object categories in real-world scenarios. To address this, we propose 3D-MOOD—the first end-to-end monocular 3D open-set detector—leveraging geometric prior-conditioned object queries and normalized image-space modeling to reliably lift 2D open-set detections into 3D space. Built upon a Transformer architecture, our method jointly optimizes 2D object detection, depth estimation, and 3D bounding box regression within a unified framework, thereby enhancing cross-scene generalization and cross-dataset transferability. Evaluated on both closed-set (Omni3D) and open-set benchmarks (Omni3D→Argoverse 2, Omni3D→ScanNet), 3D-MOOD achieves state-of-the-art performance, significantly improving detection accuracy for unknown categories and robustness to out-of-distribution generalization.

Technology Category

Application Category

📝 Abstract

Monocular 3D object detection is valuable for various applications such as robotics and AR/VR. Existing methods are confined to closed-set settings, where the training and testing sets consist of the same scenes and/or object categories. However, real-world applications often introduce new environments and novel object categories, posing a challenge to these methods. In this paper, we address monocular 3D object detection in an open-set setting and introduce the first end-to-end 3D Monocular Open-set Object Detector (3D-MOOD). We propose to lift the open-set 2D detection into 3D space through our designed 3D bounding box head, enabling end-to-end joint training for both 2D and 3D tasks to yield better overall performance. We condition the object queries with geometry prior and overcome the generalization for 3D estimation across diverse scenes. To further improve performance, we design the canonical image space for more efficient cross-dataset training. We evaluate 3D-MOOD on both closed-set settings (Omni3D) and open-set settings (Omni3D to Argoverse 2, ScanNet), and achieve new state-of-the-art results. Code and models are available at royyang0714.github.io/3D-MOOD.

Problem

Research questions and friction points this paper is trying to address.

Monocular 3D open-set object detection challenge

Lifting 2D detection to 3D for novel scenes

Generalizing 3D estimation across diverse environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lifts 2D to 3D via bounding box head

Uses geometry prior for object queries

Introduces canonical image space training

🔎 Similar Papers

No similar papers found.