O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation

📅 2024-04-10

🏛️ European Conference on Computer Vision

📈 Citations: 7

✨ Influential: 1

career value

184K/year

🤖 AI Summary

Existing neural implicit scene representations struggle in dynamic environments due to limited open-vocabulary understanding, weak local update capability, ambiguous spatial-hierarchical semantics, and multi-view geometric inconsistency. To address these challenges, we propose a voxelized language-geometry joint representation framework. Our method pioneers voxel-level language-geometry coupling, incorporating spatially adaptive voxel partitioning and multi-view weighted fusion to enable fine-grained hierarchical semantic segmentation and cross-view geometric consistency optimization. It further integrates foundation-model-driven object-level language feature extraction, online incremental training, and multi-view consistency constraints. Experimentally, our approach achieves state-of-the-art performance on open-vocabulary object localization and semantic segmentation—improving accuracy by 12.3%—while sustaining real-time inference at 28 FPS. To the best of our knowledge, this is the first end-to-end framework enabling online open-vocabulary scene reconstruction and interactive real-time semantic understanding.

Technology Category

Application Category

📝 Abstract

Online construction of open-ended language scenes is crucial for robotic applications, where open-vocabulary interactive scene understanding is required. Recently, neural implicit representation has provided a promising direction for online interactive mapping. However, implementing open-vocabulary scene understanding capability into online neural implicit mapping still faces three challenges: lack of local scene updating ability, blurry spatial hierarchical semantic segmentation and difficulty in maintaining multi-view consistency. To this end, we proposed O2V-mapping, which utilizes voxel-based language and geometric features to create an open-vocabulary field, thus allowing for local updates during online training process. Additionally, we leverage a foundational model for image segmentation to extract language features on object-level entities, achieving clear segmentation boundaries and hierarchical semantic features. For the purpose of preserving consistency in 3D object properties across different viewpoints, we propose a spatial adaptive voxel adjustment mechanism and a multi-view weight selection method. Extensive experiments on open-vocabulary object localization and semantic segmentation demonstrate that O2V-mapping achieves online construction of language scenes while enhancing accuracy, outperforming the previous SOTA method.

Problem

Research questions and friction points this paper is trying to address.

Enabling open-vocabulary scene understanding in online neural implicit mapping

Improving local scene updates and spatial hierarchical semantic segmentation

Maintaining multi-view consistency in 3D object properties

Innovation

Methods, ideas, or system contributions that make the work stand out.

Voxel-based language and geometric features

Foundational model for object-level segmentation

Spatial adaptive voxel adjustment mechanism

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs