VLA-AN: An Efficient and Onboard Vision-Language-Action Framework for Aerial Navigation in Complex Environments

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address four key challenges in autonomous UAV navigation within complex environments—domain shift, weak temporal reasoning, low safety of generated actions, and difficulty in onboard deployment—this paper proposes the first edge-deployable vision-language-action (VLA) closed-loop navigation framework. Methodologically, it introduces a synthetic dataset built upon 3D Gaussian splatting and a progressive three-stage supervised training paradigm; further, it designs a lightweight real-time action decoder coupled with a geometrically constrained safety correction mechanism to ensure physical feasibility of generated policies. Evaluated on resource-constrained onboard platforms, the framework achieves an 8.3× improvement in end-to-end inference throughput and attains up to 98.1% single-task success rate. It significantly enhances spatial referring comprehension, multi-step scene reasoning, and long-horizon navigation robustness.

Technology Category

Application Category

📝 Abstract
This paper proposes VLA-AN, an efficient and onboard Vision-Language-Action (VLA) framework dedicated to autonomous drone navigation in complex environments. VLA-AN addresses four major limitations of existing large aerial navigation models: the data domain gap, insufficient temporal navigation with reasoning, safety issues with generative action policies, and onboard deployment constraints. First, we construct a high-fidelity dataset utilizing 3D Gaussian Splatting (3D-GS) to effectively bridge the domain gap. Second, we introduce a progressive three-stage training framework that sequentially reinforces scene comprehension, core flight skills, and complex navigation capabilities. Third, we design a lightweight, real-time action module coupled with geometric safety correction. This module ensures fast, collision-free, and stable command generation, mitigating the safety risks inherent in stochastic generative policies. Finally, through deep optimization of the onboard deployment pipeline, VLA-AN achieves a robust real-time 8.3x improvement in inference throughput on resource-constrained UAVs. Extensive experiments demonstrate that VLA-AN significantly improves spatial grounding, scene reasoning, and long-horizon navigation, achieving a maximum single-task success rate of 98.1%, and providing an efficient, practical solution for realizing full-chain closed-loop autonomy in lightweight aerial robots.
Problem

Research questions and friction points this paper is trying to address.

Bridging domain gaps between simulation and real-world drone navigation data
Ensuring safe and stable autonomous flight with real-time collision avoidance
Enabling efficient onboard deployment for resource-constrained aerial robots
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 3D Gaussian Splatting to bridge domain gap
Implements progressive three-stage training for navigation
Deploys lightweight action module with safety correction
🔎 Similar Papers
No similar papers found.
Yuze Wu
Yuze Wu
Zhejiang University
Control & PlanningRobot LearningEmbodied Intelligence
M
Mo Zhu
Zhejiang University, Differential Robotics
Xingxing Li
Xingxing Li
GFZ
GPSGNSS precise positioning and orbit determinationGNSS data processingGNSS seismologyGNSS meteorology
Y
Yuheng Du
Differential Robotics
Yuxin Fan
Yuxin Fan
University of Pennsylvania
Machine LearningAIFinance
W
Wenjun Li
Zhejiang University, Differential Robotics
X
Xin Zhou
Differential Robotics
F
Fei Gao
Zhejiang University, Differential Robotics