LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing open-vocabulary 3D scene understanding methods based on 3D Gaussian splatting, which suffer from spatial and semantic hierarchy ambiguities that lead to inaccurate feature registration and loss of fine-grained semantics. To overcome these issues, the paper introduces Sparse Voxel Rasterization (SVRaster) as a structured geometric representation, leveraging monocular depth and normal priors to establish a stable geometric foundation for confidence-aware, deterministic vision-language feature registration. Furthermore, it exploits the dense alignment capability of the AM-RADIO foundation model to effectively mitigate semantic hierarchy ambiguity and prevent semantic leakage caused by overlapping Gaussians. Without requiring complex hierarchical training, the proposed method achieves state-of-the-art performance on open-vocabulary 3D object retrieval and point cloud understanding tasks, significantly outperforming existing approaches—particularly in fine-grained query scenarios.
📝 Abstract
Recent advancements in open-vocabulary 3D scene understanding heavily rely on 3D Gaussian Splatting (3DGS) to register vision-language features into 3D space. However, we identify two critical limitations in these approaches: the spatial ambiguity arising from unstructured, overlapping Gaussians which necessitates probabilistic feature registration, and the multi-level semantic ambiguity caused by pooling features over object-level masks, which dilutes fine-grained details. To address these challenges, we present a novel framework that leverages Sparse Voxel Rasterization (SVRaster) as a structured, disjoint geometry representation. By regularizing SVRaster with monocular depth and normal priors, we establish a stable geometric foundation. This enables a deterministic, confidence-aware feature registration process and suppresses the semantic bleeding artifact common in 3DGS. Furthermore, we resolve multi-level ambiguity by exploiting the emerging dense alignment properties of foundation model AM-RADIO, avoiding the computational overhead of hierarchical training methods. Our approach achieves state-of-the-art performance on Open Vocabulary 3D Object Retrieval and Point Cloud Understanding benchmarks, particularly excelling on fine-grained queries where registration methods typically fail.
Problem

Research questions and friction points this paper is trying to address.

open-vocabulary 3D scene understanding
spatial ambiguity
semantic ambiguity
3D Gaussian Splatting
feature registration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Voxel Rasterization
Open-Vocabulary 3D Understanding
Deterministic Feature Registration
Semantic Bleeding Suppression
AM-RADIO Alignment
🔎 Similar Papers
No similar papers found.