Hydra-Nav: Object Navigation via Adaptive Dual-Process Reasoning

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing vision-language navigation models, which struggle to efficiently locate unseen objects due to insufficient spatiotemporal reasoning capabilities, while complex reasoning mechanisms often incur prohibitive computational costs. To overcome this, we propose Hydra-Nav, a unified architecture inspired by the dual-process theory in cognitive science, which adaptively switches between a “slow system” for high-level planning and historical analysis and a “fast system” for efficient action execution. Our approach integrates large vision-language models with spatial-action alignment, memory-reasoning fusion, and iterative rejection fine-tuning, trained via a three-stage curriculum strategy. We introduce a new metric, SOT, to evaluate search efficiency under varying reasoning intensities. Hydra-Nav achieves significant performance gains, improving success rates by 11.1%, 17.4%, and 21.2% on the HM3D, MP3D, and OVON benchmarks, respectively, outperforming current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
While large vision-language models (VLMs) show promise for object goal navigation, current methods still struggle with low success rates and inefficient localization of unseen objects--failures primarily attributed to weak temporal-spatial reasoning. Meanwhile, recent attempts to inject reasoning into VLM-based agents improve success rates but incur substantial computational overhead. To address both the ineffectiveness and inefficiency of existing approaches, we introduce Hydra-Nav, a unified VLM architecture that adaptively switches between a deliberative slow system for analyzing exploration history and formulating high-level plans, and a reactive fast system for efficient execution. We train Hydra-Nav through a three-stage curriculum: (i) spatial-action alignment to strengthen trajectory planning, (ii) memory-reasoning integration to enhance temporal-spatial reasoning over long-horizon exploration, and (iii) iterative rejection fine-tuning to enable selective reasoning at critical decision points. Extensive experiments demonstrate that Hydra-Nav achieves state-of-the-art performance on the HM3D, MP3D, and OVON benchmarks, outperforming the second-best methods by 11.1%, 17.4%, and 21.2%, respectively. Furthermore, we introduce SOT (Success weighted by Operation Time), a new metric to measure search efficiency across VLMs with varying reasoning intensity. Results show that adaptive reasoning significantly enhances search efficiency over fixed-frequency baselines.
Problem

Research questions and friction points this paper is trying to address.

object goal navigation
temporal-spatial reasoning
vision-language models
navigation efficiency
computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive dual-process reasoning
object goal navigation
vision-language models
selective reasoning
search efficiency
🔎 Similar Papers
No similar papers found.
Z
Zixuan Wang
1ByteDance Seed, 2Institute of Automation, Chinese Academy of Sciences
H
Huang Fang
1ByteDance Seed
Shaoan Wang
Shaoan Wang
Peking University
Embodied AIVLANavigationEvent CameraPose estimation
Y
Yuanfei Luo
1ByteDance Seed
H
Heng Dong
1ByteDance Seed
Wei Li
Wei Li
Bytedance
Computer assisted Pronunciation Training
Y
Yiming Gan
4Institute of Computing Technology, Chinese Academy of Sciences