Think before Go: Hierarchical Reasoning for Image-goal Navigation

πŸ“… 2026-04-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

181K/year
πŸ€– AI Summary
This work addresses the challenges of sparse visual cues and aimless agent wandering in image-based object navigation, particularly when targets are distant or located across rooms. To tackle these issues, the authors propose HRNav, a novel framework that introduces a human-like hierarchical cognitive mechanism for the first time: a high-level module leverages a vision-language model trained on a self-constructed planning dataset to generate short-horizon navigational paths, while a low-level module executes actions via online reinforcement learning. Crucially, the framework incorporates an innovative Wandering Suppression Penalty (WSP) mechanism to minimize unproductive exploration. Extensive experiments demonstrate that HRNav significantly outperforms existing methods in both simulated and real-world environments, achieving higher success rates in long-range navigation while effectively curbing aimless roaming behavior.

Technology Category

Application Category

πŸ“ Abstract
Image-goal navigation steers an agent to a target location specified by an image in unseen environments. Existing methods primarily handle this task by learning an end-to-end navigation policy, which compares the similarities of target and observation images and directly predicts the actions. However, when the target is distant or lies in another room, such methods fail to extract informative visual cues, leading the agent to wander around. Motivated by the human cognitive principle that deliberate, high-level reasoning guides fast, reactive execution in complex tasks, we propose Hierarchical Reasoning Navigation (HRNav), a framework that decomposes image-goal navigation into high-level planning and low-level execution. In high-level planning, a vision-language model is trained on a self-collected dataset to generate a short-horizon plan, such as whether the agent should walk through the door or down the hallway. This downgrades the difficulty of the long-horizon task, making it more amenable to the execution part. In low-level execution, an online reinforcement learning policy is utilized to decide actions conditioned on the short-horizon plan. We also devise a novel Wandering Suppression Penalty (WSP) to further reduce the wandering problem. Together, these components form a hierarchical framework for Image-Goal Navigation. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our method.
Problem

Research questions and friction points this paper is trying to address.

image-goal navigation
visual cues
wandering problem
long-horizon task
unseen environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Reasoning
Image-goal Navigation
Vision-Language Model
Reinforcement Learning
Wandering Suppression
πŸ”Ž Similar Papers
No similar papers found.
P
Pengna Li
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
K
Kangyi Wu
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Shaoqing Xu
Shaoqing Xu
University of Macau, BUAA, Xiaomi EV
3D Computer Vision3D GenerationVision and Language ModelEnd2EndWorld Model
F
Fang Li
University of Macau, Xiaomi EV
Lin Zhao
Lin Zhao
Beijing Institute of Technology; JD Explore Academy
Embodied AIRobot Learning
L
Long Chen
Xiaomi EV
Zhi-Xin Yang
Zhi-Xin Yang
University of Macau
Intelligent Fault Diagnosis & MaintenanceRobotics Vision and Control for Safety Monitoring
Nanning Zheng
Nanning Zheng
Xi'an Jiaotong University