FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation

📅 2026-04-17

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the challenge of enabling zero-shot vision-language navigation for drones in complex 3D environments under ambiguous, multi-step linguistic instructions—particularly the limitations in long-horizon planning and generalization to unseen scenes. Inspired by human cognition, the authors propose a modular framework that decomposes navigation into fine-grained cognitive modules, including language processing, perception, attention, memory, imagination, reasoning, and decision-making. These modules are driven by medium-scale foundation models and coordinated through role-specific prompting and structured communication protocols. The approach introduces fine-grained cognitive modularity to zero-shot aerial navigation for the first time, substantially enhancing interpretability and collaborative efficiency. A new benchmark, AerialVLN-Fine, is also introduced to support sentence-level alignment evaluation. Experiments demonstrate significant improvements over existing zero-shot baselines in instruction following, long-term planning, and cross-scenario generalization.

Technology Category

Application Category

📝 Abstract

UAV vision-language navigation (VLN) requires an agent to navigate complex 3D environments from an egocentric perspective while following ambiguous multi-step instructions over long horizons. Existing zero-shot methods remain limited, as they often rely on large base models, generic prompts, and loosely coordinated modules. In this work, we propose FineCog-Nav, a top-down framework inspired by human cognition that organizes navigation into fine-grained modules for language processing, perception, attention, memory, imagination, reasoning, and decision-making. Each module is driven by a moderate-sized foundation model with role-specific prompts and structured input-output protocols, enabling effective collaboration and improved interpretability. To support fine-grained evaluation, we construct AerialVLN-Fine, a curated benchmark of 300 trajectories derived from AerialVLN, with sentence-level instruction-trajectory alignment and refined instructions containing explicit visual endpoints and landmark references. Experiments show that FineCog-Nav consistently outperforms zero-shot baselines in instruction adherence, long-horizon planning, and generalization to unseen environments. These results suggest the effectiveness of fine-grained cognitive modularization for zero-shot aerial navigation. Project page: https://smartdianlab.github.io/projects-FineCogNav.

Problem

Research questions and friction points this paper is trying to address.

UAV vision-language navigation

zero-shot learning

multimodal navigation

long-horizon planning

instruction following

Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained cognitive modules

zero-shot navigation

multimodal UAV