π€ AI Summary
3D open-vocabulary detection (OVD) on LiDAR point clouds faces challenges due to the absence of predefined category labels and difficulty in discriminating visually similar objects. Method: We propose a global-local collaborative reasoning and multi-round debate framework. Specifically, we integrate large language modelsβ (LLMs) commonsense reasoning with probabilistic soft logic (OV-PSL) for interpretable semantic inference; design static and dynamic class balancing (SBC/DBC) to mitigate long-tail distribution; and introduce reflective pseudo-label generation (RPLG) and background-aware object localization (BAOL) to enhance localization robustness. Results: On ScanNet and SUN RGB-D under full open-vocabulary settings, our method achieves absolute mAP improvements of 4.03% and 14.11%, respectively, significantly outperforming prior approaches. To the best of our knowledge, this is the first end-to-end 3D OVD framework enabling joint scene-level and object-level representation learning and inference.
π Abstract
The task of LiDAR-based 3D Open-Vocabulary Detection (3D OVD) requires the detector to learn to detect novel objects from point clouds without off-the-shelf training labels. Previous methods focus on the learning of object-level representations and ignore the scene-level information, thus it is hard to distinguish objects with similar classes. In this work, we propose a Global-Local Collaborative Reason and Debate with PSL (GLRD) framework for the 3D OVD task, considering both local object-level information and global scene-level information. Specifically, LLM is utilized to perform common sense reasoning based on object-level and scene-level information, where the detection result is refined accordingly. To further boost the LLM's ability of precise decisions, we also design a probabilistic soft logic solver (OV-PSL) to search for the optimal solution, and a debate scheme to confirm the class of confusable objects. In addition, to alleviate the uneven distribution of classes, a static balance scheme (SBC) and a dynamic balance scheme (DBC) are designed. In addition, to reduce the influence of noise in data and training, we further propose Reflected Pseudo Labels Generation (RPLG) and Background-Aware Object Localization (BAOL). Extensive experiments conducted on ScanNet and SUN RGB-D demonstrate the superiority of GLRD, where absolute improvements in mean average precision are $+2.82%$ on SUN RGB-D and $+3.72%$ on ScanNet in the partial open-vocabulary setting. In the full open-vocabulary setting, the absolute improvements in mean average precision are $+4.03%$ on ScanNet and $+14.11%$ on SUN RGB-D.