🤖 AI Summary
This work addresses the challenge of end-to-end autonomous flight for quadrotor racing drones in complex visual environments. We propose a purely vision-driven, model-based reinforcement learning framework. To our knowledge, this is the first successful deployment of DreamerV3 on a real-world drone platform, operating solely from raw monocular onboard camera pixels—without state estimation, intermediate representations, imitation learning, or perception-guided reward engineering. Our approach integrates world model learning, latent-space planning, and end-to-end visuomotor policy optimization. It achieves high-speed, stable gate traversal in both simulation and real-world racing tracks. Quantitative evaluation demonstrates substantial performance gains over model-free baselines such as PPO. The results validate the framework’s effectiveness, robustness, and engineering feasibility on physical robotic systems.
📝 Abstract
Autonomous drone racing has risen as a challenging robotic benchmark for testing the limits of learning, perception, planning, and control. Expert human pilots are able to agilely fly a drone through a race track by mapping the real-time feed from a single onboard camera directly to control commands. Recent works in autonomous drone racing attempting direct pixel-to-commands control policies (without explicit state estimation) have relied on either intermediate representations that simplify the observation space or performed extensive bootstrapping using Imitation Learning (IL). This paper introduces an approach that learns policies from scratch, allowing a quadrotor to autonomously navigate a race track by directly mapping raw onboard camera pixels to control commands, just as human pilots do. By leveraging model-based reinforcement learning~(RL) - specifically DreamerV3 - we train visuomotor policies capable of agile flight through a race track using only raw pixel observations. While model-free RL methods such as PPO struggle to learn under these conditions, DreamerV3 efficiently acquires complex visuomotor behaviors. Moreover, because our policies learn directly from pixel inputs, the perception-aware reward term employed in previous RL approaches to guide the training process is no longer needed. Our experiments demonstrate in both simulation and real-world flight how the proposed approach can be deployed on agile quadrotors. This approach advances the frontier of vision-based autonomous flight and shows that model-based RL is a promising direction for real-world robotics.