🤖 AI Summary
This work addresses real-time open-vocabulary semantic understanding and navigation in unknown, large-scale environments—without ground-truth pose or prior maps. We propose the first object-centric, deformable voxel semantic mapping framework tailored for resource-constrained platforms (e.g., micro aerial vehicles). Methodologically, it integrates eSAM-based pixel-wise segmentation, CLIP-based vision-language embeddings, and SLAM-driven dynamic voxel submap construction with deformable update mechanisms, enabling end-to-end cross-modal alignment from natural language queries to 3D geometric structure. Key contributions include: (i) the first framework supporting zero-prior, language-driven 3D object-level environmental understanding and autonomous exploration; (ii) state-of-the-art semantic accuracy on the Replica closed-set benchmark; and (iii) successful real-time deployment on a micro aerial vehicle (MAV) platform.
📝 Abstract
Geometrically accurate and semantically expressive map representations have proven invaluable to facilitate robust and safe mobile robot navigation and task planning. Nevertheless, real-time, open-vocabulary semantic understanding of large-scale unknown environments is still an open problem. In this paper we present FindAnything, an open-world mapping and exploration framework that incorporates vision-language information into dense volumetric submaps. Thanks to the use of vision-language features, FindAnything bridges the gap between pure geometric and open-vocabulary semantic information for a higher level of understanding while allowing to explore any environment without the help of any external source of ground-truth pose information. We represent the environment as a series of volumetric occupancy submaps, resulting in a robust and accurate map representation that deforms upon pose updates when the underlying SLAM system corrects its drift, allowing for a locally consistent representation between submaps. Pixel-wise vision-language features are aggregated from efficient SAM (eSAM)-generated segments, which are in turn integrated into object-centric volumetric submaps, providing a mapping from open-vocabulary queries to 3D geometry that is scalable also in terms of memory usage. The open-vocabulary map representation of FindAnything achieves state-of-the-art semantic accuracy in closed-set evaluations on the Replica dataset. This level of scene understanding allows a robot to explore environments based on objects or areas of interest selected via natural language queries. Our system is the first of its kind to be deployed on resource-constrained devices, such as MAVs, leveraging vision-language information for real-world robotic tasks.