🤖 AI Summary
To address semantic map degradation in dynamic environments—caused by novel object emergence and viewpoint changes—this paper proposes an incrementally updatable implicit language mapping framework. Methodologically, it fuses vision-language multimodal features to construct a joint embedding space; employs implicit neural representations to model continuous spatial semantics; introduces an incremental decoder optimization mechanism for online adaptation to newly observed objects; and mitigates vision-language prediction inconsistency via multi-view feature alignment. Experimental results demonstrate significant improvements in semantic map generalization and cross-view consistency. The framework achieves enhanced robustness and semantic accuracy in natural language–driven navigation and manipulation tasks, outperforming prior methods in dynamic, real-world scenarios.
📝 Abstract
One of the current trends in robotics is to employ large language models (LLMs) to provide non-predefined command execution and natural human-robot interaction. It is useful to have an environment map together with its language representation, which can be further utilized by LLMs. Such a comprehensive scene representation enables numerous ways of interaction with the map for autonomously operating robots. In this work, we present an approach that enhances incremental implicit mapping through the integration of vision-language features. Specifically, we (i) propose a decoder optimization technique for implicit language maps which can be used when new objects appear on the scene, and (ii) address the problem of inconsistent vision-language predictions between different viewing positions. Our experiments demonstrate the effectiveness of LiLMaps and solid improvements in performance.