APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation

๐Ÿ“… 2026-01-31
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of goal-oriented navigation for unmanned aerial vehicles (UAVs) in complex aerial environments, where the absence of effective spatial memory, interpretable decision-making, and efficient exploration mechanisms hinders performance. To overcome these limitations, the authors propose a hierarchical asynchronous parallel navigation agent that integrates a dynamic 3D semantic map, a reinforcement learningโ€“driven control policy, and an open-vocabulary target recognition module. Leveraging a zero-shot vision-language model, the approach constructs an interpretable and decoupled semantic memory representation, while an asynchronous architecture mitigates inference latency, thereby enhancing exploratory initiative and generalization capability. Evaluated on the UAV-ON benchmark, the method achieves a 4.2% improvement in Success Rate (SR) and a 2.8% gain in Success weighted by Path Length (SPL), significantly outperforming existing approaches.

Technology Category

Application Category

๐Ÿ“ Abstract
Aerial Object Goal Navigation, a challenging frontier in Embodied AI, requires an Unmanned Aerial Vehicle (UAV) agent to autonomously explore, reason, and identify a specific target using only visual perception and language description. However, existing methods struggle with the memorization of complex spatial representations in aerial environments, reliable and interpretable action decision-making, and inefficient exploration and information gathering. To address these challenges, we introduce \textbf{APEX} (Aerial Parallel Explorer), a novel hierarchical agent designed for efficient exploration and target acquisition in complex aerial settings. APEX is built upon a modular, three-part architecture: 1) Dynamic Spatio-Semantic Mapping Memory, which leverages the zero-shot capability of a Vision-Language Model (VLM) to dynamically construct high-resolution 3D Attraction, Exploration, and Obstacle maps, serving as an interpretable memory mechanism. 2) Action Decision Module, trained with reinforcement learning, which translates this rich spatial understanding into a fine-grained and robust control policy. 3) Target Grounding Module, which employs an open-vocabulary detector to achieve definitive and generalizable target identification. All these components are integrated into a hierarchical, asynchronous, and parallel framework, effectively bypassing the VLM's inference latency and boosting the agent's proactivity in exploration. Extensive experiments show that APEX outperforms the previous state of the art by +4.2\% SR and +2.8\% SPL on challenging UAV-ON benchmarks, demonstrating its superior efficiency and the effectiveness of its hierarchical asynchronous design. Our source code is provided in \href{https://github.com/4amGodvzx/apex}{GitHub}
Problem

Research questions and friction points this paper is trying to address.

Aerial Object Goal Navigation
spatial memory
action decision-making
exploration efficiency
Embodied AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aerial Object Goal Navigation
Vision-Language Model
Hierarchical Asynchronous Architecture
Dynamic Spatio-Semantic Mapping
Open-Vocabulary Detection
๐Ÿ”Ž Similar Papers
No similar papers found.
D
Daoxuan Zhang
Harbin Institute of Technology, Shenzhen
P
Ping Chen
Harbin Institute of Technology, Shenzhen
Xiaobo Xia
Xiaobo Xia
Postdoc, National University of Singapore
Data-Centric AITrustworthy AIMachine LearningMultimodal LearningAI4Science
X
Xiu Su
Central South University
R
Ruichen Zhen
Meituan Academy of Robotics Shenzhen, Meituan
J
Jianqiang Xiao
Harbin Institute of Technology, Shenzhen
Shuo Yang
Shuo Yang
Professor, Harbin Institute of Technology (Shenzhen)
Data-Centric AITrustworthy AIMachine LearningComputer Vision