🤖 AI Summary
This work addresses the limitation of existing zero-shot object navigation methods, which rely on static internet-derived commonsense knowledge and lack the capacity for continual learning from embodied 3D experiences. To overcome this, we propose TrajRAG, a novel framework that introduces, for the first time, a retrieval-augmented mechanism leveraging geometric-semantic navigation experiences. TrajRAG employs a topological-polar trajectory representation to compactly encode spatial layouts and semantic contexts, and constructs a hierarchical chunk structure to enable coarse-to-fine experience retrieval. Retrieved experiences are then integrated with large language or vision-language models to reason and generate waypoints. The framework supports lifelong learning and efficiently reuses historical navigation data. Extensive experiments on MP3D, HM3D-v1, and HM3D-v2 demonstrate significant improvements in zero-shot navigation performance, validating the effectiveness of TrajRAG in experience retrieval and decision-making.
📝 Abstract
Existing zero-shot Object Goal Navigation (ObjectNav) methods often exploit commonsense knowledge from large language or vision-language models to guide navigation. However, such knowledge arises from internet-scale text rather than embodied 3D experience, and episodic observations collected during navigation are typically discarded, preventing the accumulation of lifelong experience. To this end, we propose Trajectory RAG (TrajRAG), a retrieval-augmented generation framework that enhances large-model reasoning by retrieving geometric-semantic experiences. TrajRAG incrementally accumulates episodic observations from past navigation episodes. To structure these observations, we propose a topological-polar (topo-polar) trajectory representation that compactly encodes spatial layouts and semantic contexts, effectively removing redundancies in raw episodic observations. A hierarchical chunking structure further organizes similar topo-polar trajectories into unified summaries, enabling coarse-to-fine retrieval. During navigation, candidate frontiers generate multiple trajectory hypotheses that query TrajRAG for similar past trajectories, guiding large-model reasoning for waypoint selection. New experiences are continually consolidated into TrajRAG, enabling the accumulation of lifelong navigation experience. Experiments on MP3D, HM3D-v1, and HM3D-v2 show that TrajRAG effectively retrieves relevant geometric-semantic experiences and improves zero-shot ObjectNav performance.