🤖 AI Summary
Open-vocabulary 3D object detection aims to localize objects from unseen categories in point clouds, yet existing methods suffer from poor cross-modal alignment robustness due to semantic inconsistency between 3D point cloud features and 2D image features. To address this, we propose a semantic consistency alignment mechanism: (1) high-quality 3D pseudo-labels are mined via self-supervised learning; (2) a dynamic alignment quality assessment module jointly filters noisy matches from multiple sources; and (3) the 3D detector is deeply integrated with a vision-language model to enable open-vocabulary classification and precise 3D localization. Evaluated on nuScenes, our method achieves state-of-the-art performance—improving recall for novel categories by +12.3% and 3D localization accuracy (AP) by +8.7%—demonstrating significant gains in both generalization and geometric precision.
📝 Abstract
Open-vocabulary 3D object detection for autonomous driving aims to detect novel objects beyond the predefined training label sets in point cloud scenes. Existing approaches achieve this by connecting traditional 3D object detectors with vision-language models (VLMs) to regress 3D bounding boxes for novel objects and perform open-vocabulary classification through cross-modal alignment between 3D and 2D features. However, achieving robust cross-modal alignment remains a challenge due to semantic inconsistencies when generating corresponding 3D and 2D feature pairs. To overcome this challenge, we present OV-SCAN, an Open-Vocabulary 3D framework that enforces Semantically Consistent Alignment for Novel object discovery. OV-SCAN employs two core strategies: discovering precise 3D annotations and filtering out low-quality or corrupted alignment pairs (arising from 3D annotation, occlusion-induced, or resolution-induced noise). Extensive experiments on the nuScenes dataset demonstrate that OV-SCAN achieves state-of-the-art performance.