FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment

📅 2025-04-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses real-time open-vocabulary semantic understanding and navigation in unknown, large-scale environments—without ground-truth pose or prior maps. We propose the first object-centric, deformable voxel semantic mapping framework tailored for resource-constrained platforms (e.g., micro aerial vehicles). Methodologically, it integrates eSAM-based pixel-wise segmentation, CLIP-based vision-language embeddings, and SLAM-driven dynamic voxel submap construction with deformable update mechanisms, enabling end-to-end cross-modal alignment from natural language queries to 3D geometric structure. Key contributions include: (i) the first framework supporting zero-prior, language-driven 3D object-level environmental understanding and autonomous exploration; (ii) state-of-the-art semantic accuracy on the Replica closed-set benchmark; and (iii) successful real-time deployment on a micro aerial vehicle (MAV) platform.

Technology Category

Application Category

📝 Abstract
Geometrically accurate and semantically expressive map representations have proven invaluable to facilitate robust and safe mobile robot navigation and task planning. Nevertheless, real-time, open-vocabulary semantic understanding of large-scale unknown environments is still an open problem. In this paper we present FindAnything, an open-world mapping and exploration framework that incorporates vision-language information into dense volumetric submaps. Thanks to the use of vision-language features, FindAnything bridges the gap between pure geometric and open-vocabulary semantic information for a higher level of understanding while allowing to explore any environment without the help of any external source of ground-truth pose information. We represent the environment as a series of volumetric occupancy submaps, resulting in a robust and accurate map representation that deforms upon pose updates when the underlying SLAM system corrects its drift, allowing for a locally consistent representation between submaps. Pixel-wise vision-language features are aggregated from efficient SAM (eSAM)-generated segments, which are in turn integrated into object-centric volumetric submaps, providing a mapping from open-vocabulary queries to 3D geometry that is scalable also in terms of memory usage. The open-vocabulary map representation of FindAnything achieves state-of-the-art semantic accuracy in closed-set evaluations on the Replica dataset. This level of scene understanding allows a robot to explore environments based on objects or areas of interest selected via natural language queries. Our system is the first of its kind to be deployed on resource-constrained devices, such as MAVs, leveraging vision-language information for real-world robotic tasks.
Problem

Research questions and friction points this paper is trying to address.

Real-time open-vocabulary semantic understanding of large-scale unknown environments
Bridging geometric and open-vocabulary semantic information for robot navigation
Deploying vision-language mapping on resource-constrained devices like MAVs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-vocabulary semantic mapping with vision-language features
Object-centric volumetric submaps with SAM-generated segments
Real-time deployment on resource-constrained devices
🔎 Similar Papers
No similar papers found.