OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing open-vocabulary 3D instance segmentation methods face two key bottlenecks: (1) proposal generation relies on dataset-specific networks or mesh-based structures, limiting generalization to mesh-free, unstructured scenes; and (2) CLIP-style classifiers exhibit weak semantic reasoning, failing to handle compositional and functional text queries. To address these, we propose an online cross-view visual–spatial tracking mechanism that dynamically constructs 3D instances without predefined proposals, and integrate a multimodal large language model to enhance joint text–geometry reasoning for fine-grained semantic understanding. Our method initializes point-cloud instances from 2D open-vocabulary segmentation masks and depth maps, achieves frame-consistent tracking via fusion of DINO visual features and spatial trajectories, and supports superpoint optimization. Evaluated on ScanNet200, Replica, and other benchmarks, it achieves state-of-the-art performance, significantly improving generalization to mesh-free environments and accuracy on complex linguistic queries.

Technology Category

Application Category

📝 Abstract

Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D open-vocabulary segmenter to generate masks, which are lifted to 3D point clouds using depth. Mask-guided instance features are then extracted using DINO feature maps, and our tracker fuses visual and spatial cues to maintain instance consistency. The core pipeline is entirely mesh-free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available. Finally, we replace CLIP with a multi-modal large language model (MLLM), significantly enhancing compositional reasoning for complex user queries. Extensive experiments on diverse benchmarks, including ScanNet200, Replica, ScanNet++, and SceneFun3D, demonstrate state-of-the-art performance and strong generalization capabilities.

Problem

Research questions and friction points this paper is trying to address.

Generalizing 3D instance segmentation to unstructured mesh-free environments

Overcoming limitations of dataset-specific proposals and mesh-dependent methods

Enhancing textual reasoning for compositional user queries beyond CLIP

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online visual-spatial tracker for cross-view consistent proposals

Mesh-free pipeline with optional superpoints refinement module

Multi-modal large language model replacing CLIP for better reasoning

🔎 Similar Papers

Open-Ended 3D Point Cloud Instance Segmentation