🤖 AI Summary
To address low search efficiency for target objects in unknown environments, difficulty in maintaining implicit memory for long-horizon planning, and the lack of fine-grained semantic information in explicit maps, this paper proposes a modular navigation framework. Our core method introduces the Frontier-Object Map—a novel online semantic map that jointly encodes spatial frontier structures and object-level semantics—and tightly couples it with a vision-language model (VLM) to enable coordinated high-level goal reasoning and low-level path planning. The framework supports real-time multimodal scene understanding and incremental mapping, trained on a large-scale, automatically generated dataset of realistic scanned navigation scenes. Evaluated on MP3D and HM3D benchmarks, our approach achieves state-of-the-art SPL performance. Further validation on a physical robot platform demonstrates robust deployment capability, significantly improving both long-range navigation robustness and target recognition accuracy.
📝 Abstract
This paper addresses the Object Goal Navigation problem, where a robot must efficiently find a target object in an unknown environment. Existing implicit memory-based methods struggle with long-term memory retention and planning, while explicit map-based approaches lack rich semantic information. To address these challenges, we propose FOM-Nav, a modular framework that enhances exploration efficiency through Frontier-Object Maps and vision-language models. Our Frontier-Object Maps are built online and jointly encode spatial frontiers and fine-grained object information. Using this representation, a vision-language model performs multimodal scene understanding and high-level goal prediction, which is executed by a low-level planner for efficient trajectory generation. To train FOM-Nav, we automatically construct large-scale navigation datasets from real-world scanned environments. Extensive experiments validate the effectiveness of our model design and constructed dataset. FOM-Nav achieves state-of-the-art performance on the MP3D and HM3D benchmarks, particularly in navigation efficiency metric SPL, and yields promising results on a real robot.