Collaborative Novel Object Discovery and Box-Guided Cross-Modal Alignment for Open-Vocabulary 3D Object Detection

📅 2024-06-02
🏛️ IEEE Transactions on Pattern Analysis and Machine Intelligence
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
Open-vocabulary 3D object detection (OV-3DDet) faces critical challenges in sparse base-class supervision and poor generalization to arbitrary novel categories. To address these, we propose CoDAv2—a unified framework integrating collaborative discovery and cross-modal alignment. Its key contributions are: (1) the first 3D-NODE pseudo-label discovery and distribution enhancement strategy, which significantly improves the quality of novel-class samples; and (2) Discovery-driven Cross-modal Alignment (DCMA) and its extension Box-DCMA, enabling iterative point cloud–image–text alignment while leveraging 2D open-vocabulary semantic priors and bounding-box guidance for 3D classification. Evaluated on SUN-RGBD and ScanNetv2, CoDAv2 achieves novel-class APs of 9.17 and 9.12, respectively—surpassing state-of-the-art methods by +5.56 and +5.38. This represents a substantial advance in open-vocabulary 3D detection performance.

Technology Category

Application Category

📝 Abstract
Open-vocabulary 3D Object Detection (OV-3DDet) addresses the detection of objects from an arbitrary list of novel categories in 3D scenes, which remains a very challenging problem. In this work, we propose CoDAv2, a unified framework designed to innovatively tackle both the localization and classification of novel 3D objects, under the condition of limited base categories. For localization, the proposed 3D Novel Object Discovery (3D-NOD) strategy utilizes 3D geometries and 2D open-vocabulary semantic priors to discover pseudo labels for novel objects during training. 3D-NOD is further extended with an Enrichment strategy that significantly enriches the novel object distribution in the training scenes, and then enhances the model's ability to localize more novel objects. The 3D-NOD with Enrichment is termed 3D-NODE. For classification, the Discovery-driven Cross-modal Alignment (DCMA) module aligns features from 3D point clouds and 2D/textual modalities, employing both class-agnostic and class-specific alignments that are iteratively refined to handle the expanding vocabulary of objects. Besides, 2D box guidance boosts the classification accuracy against complex background noises, which is coined as Box-DCMA. Extensive evaluation demonstrates the superiority of CoDAv2. CoDAv2 outperforms the best-performing method by a large margin ($ m {AP}_{Novel}$ of 9.17vs. 3.61 on SUN-RGBD and 9.12vs. 3.74 on ScanNetv2). Source code and pre-trained models are available at the GitHub project page.
Problem

Research questions and friction points this paper is trying to address.

Detecting novel 3D objects with limited base categories
Aligning 3D point clouds with 2D/textual modalities
Improving localization and classification of open-vocabulary objects
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D-NODE enriches novel object distribution.
DCMA aligns 3D and 2D/textual features.
Box-DCMA boosts classification with 2D boxes.
🔎 Similar Papers
No similar papers found.
Y
Yang Cao
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology
Y
Yihan Zeng
Huawei Noah’s Ark Lab
H
Hang Xu
Huawei Noah’s Ark Lab
D
Dan Xu
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology