🤖 AI Summary
This work addresses label noise and sparsity in open-vocabulary 3D semantic segmentation of LiDAR point clouds using image-based vision-language models (VLMs) in driving scenarios. We propose LOSC, a novel framework that dynamically consolidates sparse 2D semantic labels generated by VLMs via spatiotemporal consistency modeling and multi-view image augmentation for robustness validation—yielding high-quality pseudo-labels to supervise end-to-end 3D neural network training. On nuScenes and SemanticKITTI, LOSC significantly outperforms existing zero-shot open-set semantic and panoptic segmentation methods, establishing new state-of-the-art performance in open-vocabulary 3D segmentation. Our results empirically validate the effectiveness of co-designing cross-modal semantic transfer and label optimization for robust 3D scene understanding.
📝 Abstract
We study the use of image-based Vision-Language Models (VLMs) for open-vocabulary segmentation of lidar scans in driving settings. Classically, image semantics can be back-projected onto 3D point clouds. Yet, resulting point labels are noisy and sparse. We consolidate these labels to enforce both spatio-temporal consistency and robustness to image-level augmentations. We then train a 3D network based on these refined labels. This simple method, called LOSC, outperforms the SOTA of zero-shot open-vocabulary semantic and panoptic segmentation on both nuScenes and SemanticKITTI, with significant margins.