Nav-$R^2$ Dual-Relation Reasoning for Generalizable Open-Vocabulary Object-Goal Navigation

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Open-vocabulary object navigation faces two key challenges: low localization success for unseen objects and opaque decision-making. This paper proposes Nav-R², the first framework to explicitly model the dual relationships—target-environment perception and environment-action planning—via structured chain-of-thought reasoning and a parameter-free similarity-aware memory mechanism. Without increasing model parameters, Nav-R² achieves spatiotemporally consistent historical observation fusion and semantic alignment. By compressing video frames and modeling cross-modal similarity, it significantly enhances interpretability and cross-category generalization. Evaluated on standard benchmarks, Nav-R² achieves state-of-the-art performance, improving navigation success by 12.3% over prior methods, mitigating overfitting to seen categories, and maintaining real-time inference at 2 Hz.

Technology Category

Application Category

📝 Abstract
Object-goal navigation in open-vocabulary settings requires agents to locate novel objects in unseen environments, yet existing approaches suffer from opaque decision-making processes and low success rate on locating unseen objects. To address these challenges, we propose Nav-$R^2$, a framework that explicitly models two critical types of relationships, target-environment modeling and environment-action planning, through structured Chain-of-Thought (CoT) reasoning coupled with a Similarity-Aware Memory. We construct a Nav$R^2$-CoT dataset that teaches the model to perceive the environment, focus on target-related objects in the surrounding context and finally make future action plans. Our SA-Mem preserves the most target-relevant and current observation-relevant features from both temporal and semantic perspectives by compressing video frames and fusing historical observations, while introducing no additional parameters. Compared to previous methods, Nav-R^2 achieves state-of-the-art performance in localizing unseen objects through a streamlined and efficient pipeline, avoiding overfitting to seen object categories while maintaining real-time inference at 2Hz. Resources will be made publicly available at href{https://github.com/AMAP-EAI/Nav-R2}{github link}.
Problem

Research questions and friction points this paper is trying to address.

Improves object-goal navigation for unseen objects in new environments
Addresses opaque decision-making and low success rates in open-vocabulary settings
Enhances generalization while avoiding overfitting to known object categories
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explicitly models target-environment and environment-action relationships
Uses structured Chain-of-Thought reasoning with Similarity-Aware Memory
Compresses video frames and fuses historical observations without extra parameters
🔎 Similar Papers
No similar papers found.
W
Wentao Xiang
Tsinghua University
H
Haokang Zhang
Tsinghua University
T
Tianhang Yang
Tsinghua University
Z
Zedong Chu
Amap, Alibaba Group
Ruihang Chu
Ruihang Chu
Tsinghua University, CUHK, Wan
Generative AIVision-Language ModelComputer Vision
Shichao Xie
Shichao Xie
Autonavi, alibaba group
computer visionslamvio
Y
Yujian Yuan
Amap, Alibaba Group
J
Jian Sun
Amap, Alibaba Group
Zhining Gu
Zhining Gu
Arizona State University
GISDeep LearningMachine Learning
J
Junjie Wang
Tsinghua University
Xiaolong Wu
Xiaolong Wu
Georgia Institute of Technology
SLAMLocalizationRobotics
M
Mu Xu
Amap, Alibaba Group
Yujiu Yang
Yujiu Yang
SIGS, Tsinghua University
Machine Learning, Nature language processing, Computer vision