🤖 AI Summary
This work addresses the challenges of deploying current vision-language models (VLMs) in real-time drone navigation, which are hindered by mismatched inference frequencies, insufficient 3D scene understanding, and an imbalance between semantic guidance and motion efficiency. To overcome these limitations, we propose AirHunt, a novel system featuring an asynchronous dual-path architecture that integrates VLM-based semantic reasoning with continuous path planning. AirHunt further introduces active dual-task reasoning and a semantic-geometric consistency planning module to dynamically coordinate semantic objectives with motion efficiency while adapting to environmental changes. Experimental results demonstrate that AirHunt significantly improves success rates, reduces navigation errors, and shortens flight time across diverse outdoor open-set target navigation tasks. Real-world evaluations confirm its practicality and robustness under complex conditions.
📝 Abstract
Recent advances in large Vision-Language Models (VLMs) have provided rich semantic understanding that empowers drones to search for open-set objects via natural language instructions. However, prior systems struggle to integrate VLMs into practical aerial systems due to orders-of-magnitude frequency mismatch between VLM inference and real-time planning, as well as VLMs'limited 3D scene understanding. They also lack a unified mechanism to balance semantic guidance with motion efficiency in large-scale environments. To address these challenges, we present AirHunt, an aerial object navigation system that efficiently locates open-set objects with zero-shot generalization in outdoor environments by seamlessly fusing VLM semantic reasoning with continuous path planning. AirHunt features a dual-pathway asynchronous architecture that establishes a synergistic interface between VLM reasoning and path planning, enabling continuous flight with adaptive semantic guidance that evolves through motion. Moreover, we propose an active dual-task reasoning module that exploits geometric and semantic redundancy to enable selective VLM querying, and a semantic-geometric coherent planning module that dynamically reconciles semantic priorities and motion efficiency in a unified framework, enabling seamless adaptation to environmental heterogeneity. We evaluate AirHunt across diverse object navigation tasks and environments, demonstrating a higher success rate with lower navigation error and reduced flight time compared to state-of-the-art methods. Real-world experiments further validate AirHunt's practical capability in complex and challenging environments. Code and dataset will be made publicly available before publication.