AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing image-based object navigation methods, which typically satisfy only coarse success criteria (e.g., stopping within 1 meter of the target) and thus fail to support downstream tasks requiring precise localization, such as robotic grasping. The authors propose a training-free system that treats the target image as a geometric query, leveraging semantic-guided exploration to locate relevant viewpoints and invoking a multi-view 3D foundation model to recover accurate 6-degree-of-freedom camera poses. By integrating a semantic-geometric cascaded mechanism with a self-verification loop for pose refinement, the method substantially improves navigation precision. Evaluated on Gibson and HM3D benchmarks, it achieves success rates of 93.1% and 82.6%, respectively, with pose errors as low as 0.27 m / 3.41° and 0.21 m / 1.23°, representing a 5–10× improvement over baseline approaches.
📝 Abstract
Image Goal Navigation (ImageNav) is evaluated by a coarse success criterion, the agent must stop within 1m of the target, which is sufficient for finding objects but falls short for downstream tasks such as grasping that require precise positioning. We introduce AnyImageNav, a training-free system that pushes ImageNav toward this more demanding setting. Our key insight is that the goal image can be treated as a geometric query: any photo of an object, a hallway, or a room corner can be registered to the agent's observations via dense pixel-level correspondences, enabling recovery of the exact 6-DoF camera pose. Our method realizes this through a semantic-to-geometric cascade: a semantic relevance signal guides exploration and acts as a proximity gate, invoking a 3D multi-view foundation model only when the current view is highly relevant to the goal image; the model then self-certifies its registration in a loop for an accurate recovered pose. Our method sets state-of-the-art navigation success rates on Gibson (93.1%) and HM3D (82.6%), and achieves pose recovery that prior methods do not provide: a position error of 0.27m and heading error of 3.41 degrees on Gibson, and 0.21m / 1.23 degrees on HM3D, a 5-10x improvement over adapted baselines.
Problem

Research questions and friction points this paper is trying to address.

Image Goal Navigation
precise positioning
6-DoF pose estimation
last-meter navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Image-Goal Navigation
6-DoF Pose Estimation
Multi-view Foundation Model
Semantic-to-Geometric Cascade
Training-Free Navigation
🔎 Similar Papers
No similar papers found.
Y
Yijie Deng
NYUAD Center for Artificial Intelligence and Robotics (CAIR), Abu Dhabi, UAE; New York University Abu Dhabi, Electrical Engineering, Abu Dhabi 129188, UAE; New York University, Electrical & Computer Engineering Dept., Brooklyn, NY 11201, USA; Embodied AI and Robotics (AIR) Lab, NYU Abu Dhabi, UAE
S
Shuaihang Yuan
NYUAD Center for Artificial Intelligence and Robotics (CAIR), Abu Dhabi, UAE; New York University Abu Dhabi, Electrical Engineering, Abu Dhabi 129188, UAE; Embodied AI and Robotics (AIR) Lab, NYU Abu Dhabi, UAE
Yi Fang
Yi Fang
Associate Professor of NYU Abu Dhabi and NYU Tandon
3D Computer Vision3D Deep Learning3D Meta Learning