🤖 AI Summary
This work proposes a training-free, open-vocabulary approach for 3D scene understanding at the voxel level, enabling semantic grouping and textual description without relying on pretrained text encoders such as CLIP or BERT. Leveraging a sparse voxel grid representation, the method directly employs multimodal large language models (MLLMs) to perform voxel clustering and generate semantic labels, thereby constructing a semantic scene map that supports both open-vocabulary segmentation and referring expression segmentation. By integrating sparse voxel representations with a text-to-text retrieval mechanism, the approach significantly outperforms existing methods on complex referring expression segmentation tasks, demonstrating strong zero-shot 3D semantic understanding capabilities.
📝 Abstract
We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.