FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the challenge of achieving open-vocabulary 3D semantic mapping in large-scale environments, where existing methods often fall short. The authors propose an online two-layer semantic mapping framework that unifies voxel-level dense semantics and instance-level open-vocabulary representations within a shared voxel map for the first time. By introducing a cross-layer semantic fusion mechanism within a sliding window, the method jointly optimizes the quality of both semantic layers. Coupled with multi-view semantic embeddings and voxelized 3D mapping, this approach significantly enhances accuracy and scalability. The framework demonstrates high-fidelity, generalizable open-vocabulary semantic mapping on standard 3D semantic segmentation benchmarks as well as large, multi-floor real-world scenes, effectively extending to unseen semantic concepts.

📝 Abstract

Open-vocabulary semantic mapping enables robots to spatially ground previously unseen concepts without requiring predefined class sets. Current training-free methods commonly rely on multi-view fusion of semantic embeddings into a 3D map, either at the instance-level via segmenting views and encoding image crops of segments, or by projecting image patch embeddings directly into a dense semantic map. The latter approach sidesteps segmentation and 2D-to-3D instance association by operating on full uncropped image frames, but existing methods remain limited in scalability. We present FUS3DMaps, an online dual-layer semantic mapping method that jointly maintains both dense and instance-level open-vocabulary layers within a shared voxel map. This design enables further voxel-level semantic fusion of the layer embeddings, combining the complementary strengths of both semantic mapping approaches. We find that our proposed semantic cross-layer fusion approach improves the quality of both the instance-level and dense layers, while also enabling a scalable and highly accurate instance-level map where the dense layer and cross-layer fusion are restricted to a spatial sliding window. Experiments on established 3D semantic segmentation benchmarks as well as a selection of large-scale scenes show that FUS3DMaps achieves accurate open-vocabulary semantic mapping at multi-story building scales. Additional material and code will be made available: https://githanonymous.github.io/FUS3DMaps/.

Problem

Research questions and friction points this paper is trying to address.

open-vocabulary semantic mapping

3D semantic mapping

scalability

instance-level mapping

dense semantic map

Innovation

Methods, ideas, or system contributions that make the work stand out.

open-vocabulary semantic mapping

3D fusion

dual-layer mapping