OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

📅 2026-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a training-free, open-vocabulary approach for 3D scene understanding at the voxel level, enabling semantic grouping and textual description without relying on pretrained text encoders such as CLIP or BERT. Leveraging a sparse voxel grid representation, the method directly employs multimodal large language models (MLLMs) to perform voxel clustering and generate semantic labels, thereby constructing a semantic scene map that supports both open-vocabulary segmentation and referring expression segmentation. By integrating sparse voxel representations with a text-to-text retrieval mechanism, the approach significantly outperforms existing methods on complex referring expression segmentation tasks, demonstrating strong zero-shot 3D semantic understanding capabilities.

Technology Category

Application Category

📝 Abstract
We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.
Problem

Research questions and friction points this paper is trying to address.

open-vocabulary 3D scene understanding
voxel grouping
captioning
referring expression segmentation
training-free
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
open-vocabulary 3D scene understanding
voxel grouping and captioning
vision-language models
referring expression segmentation
🔎 Similar Papers
No similar papers found.