NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation

📅 2025-04-20

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Current vision-language models (VLMs) excel at image-level understanding but struggle with precise 3D localization and open-vocabulary recognition in zero-shot 3D instance segmentation. To address this, we propose a training-free hard visual prompting method: leveraging 3D Gaussian splatting to render multi-view images of target objects, followed by camera pose interpolation to generate geometrically consistent view sequences—requiring no 2D–3D alignment, model fine-tuning, or optimization. This is the first work to seamlessly extend hard visual prompting to zero-shot, instance-level, open-vocabulary 3D segmentation by aligning VLM features across rendered views. Evaluated on ScanNet and S3DIS, our method achieves a +12.3% mAP@50 improvement over prior approaches. It enables segmentation of unseen categories using arbitrary natural language descriptions, significantly enhancing robustness and generalization in 3D perception.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) have demonstrated impressive zero-shot transfer capabilities in image-level visual perception tasks. However, they fall short in 3D instance-level segmentation tasks that require accurate localization and recognition of individual objects. To bridge this gap, we introduce a novel 3D Gaussian Splatting based hard visual prompting approach that leverages camera interpolation to generate diverse viewpoints around target objects without any 2D-3D optimization or fine-tuning. Our method simulates realistic 3D perspectives, effectively augmenting existing hard visual prompts by enforcing geometric consistency across viewpoints. This training-free strategy seamlessly integrates with prior hard visual prompts, enriching object-descriptive features and enabling VLMs to achieve more robust and accurate 3D instance segmentation in diverse 3D scenes.

Problem

Research questions and friction points this paper is trying to address.

Bridging VLMs' gap in 3D instance-level segmentation tasks

Generating diverse viewpoints without 2D-3D optimization or fine-tuning

Enhancing 3D instance segmentation accuracy in diverse scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Gaussian Splatting hard visual prompting

Camera interpolation for diverse viewpoints

Training-free geometric consistency augmentation

🔎 Similar Papers

Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant