🤖 AI Summary
This work addresses the challenge of zero-shot recognition in open-vocabulary 3D semantic segmentation by proposing a novel cross-modal alignment strategy that bridges the modality gap between LiDAR point clouds and textual descriptions. The method generates class-conditional prototype images from text prompts and leverages knowledge distilled from 2D vision foundation models to construct a 3D network, aligning point cloud features with visual features extracted from the generated images to enable semantic segmentation of unseen categories. By introducing text-to-image generation into 3D open-vocabulary tasks for the first time, this framework effectively replaces conventional direct cross-modal alignment approaches. Experimental results demonstrate that the proposed method significantly outperforms existing techniques on both the nuScenes and SemanticKITTI benchmarks, achieving state-of-the-art performance.
📝 Abstract
This paper presents a new method for the zero-shot open-vocabulary semantic segmentation (OVSS) of 3D automotive lidar data. To circumvent the recognized image-text modality gap that is intrinsic to approaches based on Vision Language Models (VLMs) such as CLIP, our method relies instead on image generation from text, to create prototype images. Given a 3D network distilled from a 2D Vision Foundation Model (VFM), we then label a point cloud by matching 3D point features with 2D image features of these prototypes. Our method is state-of-the-art for OVSS on nuScenes and SemanticKITTI. Code, pre-trained models, and generated images are available at https://github.com/valeoai/IGLOSS.