GRaD-Nav++: Vision-Language Model Enabled Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics

📅 2025-06-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current autonomous drones struggle to interpret and execute natural-language instructions in real time within unstructured environments, primarily due to reliance on handcrafted skills, labor-intensive hyperparameter tuning, or excessive computational overhead—hindering onboard deployment. This paper introduces the first fully onboard, real-time lightweight Vision-Language-Action (VLA) framework. It innovatively integrates 3D Gaussian Splatting (3DGS)-based simulation, differentiable reinforcement learning (DiffRL), and a Mixture-of-Experts (MoE) action head to jointly enhance generalization and continual learning capability. In simulation, task success rates reach 83% (seen) and 75% (unseen); on physical drones, they achieve 67% and 50%, respectively. Cross-environment average success rates are 81% in simulation and 67% on hardware—demonstrating substantial improvements in practicality and robustness of language-guided navigation.

Technology Category

Application Category

📝 Abstract
Autonomous drones capable of interpreting and executing high-level language instructions in unstructured environments remain a long-standing goal. Yet existing approaches are constrained by their dependence on hand-crafted skills, extensive parameter tuning, or computationally intensive models unsuitable for onboard use. We introduce GRaD-Nav++, a lightweight Vision-Language-Action (VLA) framework that runs fully onboard and follows natural-language commands in real time. Our policy is trained in a photorealistic 3D Gaussian Splatting (3DGS) simulator via Differentiable Reinforcement Learning (DiffRL), enabling efficient learning of low-level control from visual and linguistic inputs. At its core is a Mixture-of-Experts (MoE) action head, which adaptively routes computation to improve generalization while mitigating forgetting. In multi-task generalization experiments, GRaD-Nav++ achieves a success rate of 83% on trained tasks and 75% on unseen tasks in simulation. When deployed on real hardware, it attains 67% success on trained tasks and 50% on unseen ones. In multi-environment adaptation experiments, GRaD-Nav++ achieves an average success rate of 81% across diverse simulated environments and 67% across varied real-world settings. These results establish a new benchmark for fully onboard Vision-Language-Action (VLA) flight and demonstrate that compact, efficient models can enable reliable, language-guided navigation without relying on external infrastructure.
Problem

Research questions and friction points this paper is trying to address.

Enabling drones to follow natural-language commands in real time
Overcoming dependence on hand-crafted skills and parameter tuning
Achieving reliable onboard vision-language-action navigation without external infrastructure
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight Vision-Language-Action framework onboard
3D Gaussian Splatting simulator for training
Mixture-of-Experts action head for generalization
🔎 Similar Papers
No similar papers found.
Q
Qianzhong Chen
Department of Mechanical Engineering, Stanford University, Stanford, CA 94305, USA
N
Naixiang Gao
Department of Mechanical Engineering, Stanford University, Stanford, CA 94305, USA
S
Suning Huang
Aeronautics and Astronautics Department, Stanford University, Stanford, CA 94305, USA
J
JunEn Low
Department of Mechanical Engineering, Stanford University, Stanford, CA 94305, USA
Timothy Chen
Timothy Chen
Stanford University
RoboticsPerceptionControl
J
Jiankai Sun
Aeronautics and Astronautics Department, Stanford University, Stanford, CA 94305, USA
Mac Schwager
Mac Schwager
Stanford University
RoboticsControlMulti-Agent SystemsMachine LearningStatistical Inference and Estimation