🤖 AI Summary
Current vision-language models (VLMs) excel at image-level understanding but struggle with precise 3D localization and open-vocabulary recognition in zero-shot 3D instance segmentation. To address this, we propose a training-free hard visual prompting method: leveraging 3D Gaussian splatting to render multi-view images of target objects, followed by camera pose interpolation to generate geometrically consistent view sequences—requiring no 2D–3D alignment, model fine-tuning, or optimization. This is the first work to seamlessly extend hard visual prompting to zero-shot, instance-level, open-vocabulary 3D segmentation by aligning VLM features across rendered views. Evaluated on ScanNet and S3DIS, our method achieves a +12.3% mAP@50 improvement over prior approaches. It enables segmentation of unseen categories using arbitrary natural language descriptions, significantly enhancing robustness and generalization in 3D perception.
📝 Abstract
Vision-language models (VLMs) have demonstrated impressive zero-shot transfer capabilities in image-level visual perception tasks. However, they fall short in 3D instance-level segmentation tasks that require accurate localization and recognition of individual objects. To bridge this gap, we introduce a novel 3D Gaussian Splatting based hard visual prompting approach that leverages camera interpolation to generate diverse viewpoints around target objects without any 2D-3D optimization or fine-tuning. Our method simulates realistic 3D perspectives, effectively augmenting existing hard visual prompts by enforcing geometric consistency across viewpoints. This training-free strategy seamlessly integrates with prior hard visual prompts, enriching object-descriptive features and enabling VLMs to achieve more robust and accurate 3D instance segmentation in diverse 3D scenes.