🤖 AI Summary
This work addresses the limitations of existing vision-language navigation models, which struggle to efficiently locate unseen objects due to insufficient spatiotemporal reasoning capabilities, while complex reasoning mechanisms often incur prohibitive computational costs. To overcome this, we propose Hydra-Nav, a unified architecture inspired by the dual-process theory in cognitive science, which adaptively switches between a “slow system” for high-level planning and historical analysis and a “fast system” for efficient action execution. Our approach integrates large vision-language models with spatial-action alignment, memory-reasoning fusion, and iterative rejection fine-tuning, trained via a three-stage curriculum strategy. We introduce a new metric, SOT, to evaluate search efficiency under varying reasoning intensities. Hydra-Nav achieves significant performance gains, improving success rates by 11.1%, 17.4%, and 21.2% on the HM3D, MP3D, and OVON benchmarks, respectively, outperforming current state-of-the-art methods.
📝 Abstract
While large vision-language models (VLMs) show promise for object goal navigation, current methods still struggle with low success rates and inefficient localization of unseen objects--failures primarily attributed to weak temporal-spatial reasoning. Meanwhile, recent attempts to inject reasoning into VLM-based agents improve success rates but incur substantial computational overhead. To address both the ineffectiveness and inefficiency of existing approaches, we introduce Hydra-Nav, a unified VLM architecture that adaptively switches between a deliberative slow system for analyzing exploration history and formulating high-level plans, and a reactive fast system for efficient execution. We train Hydra-Nav through a three-stage curriculum: (i) spatial-action alignment to strengthen trajectory planning, (ii) memory-reasoning integration to enhance temporal-spatial reasoning over long-horizon exploration, and (iii) iterative rejection fine-tuning to enable selective reasoning at critical decision points. Extensive experiments demonstrate that Hydra-Nav achieves state-of-the-art performance on the HM3D, MP3D, and OVON benchmarks, outperforming the second-best methods by 11.1%, 17.4%, and 21.2%, respectively. Furthermore, we introduce SOT (Success weighted by Operation Time), a new metric to measure search efficiency across VLMs with varying reasoning intensity. Results show that adaptive reasoning significantly enhances search efficiency over fixed-frequency baselines.