🤖 AI Summary
In dense, cluttered environments, conventional model-based navigation fails for high-speed quadrotors (7 m/s) due to sensor noise, error accumulation, and processing latency. To address this, we propose a vision transformer (ViT)-based end-to-end obstacle avoidance control framework that directly maps depth images to low-level control commands. This work introduces ViT to end-to-end visual navigation for the first time, integrating a recursive mechanism to enhance temporal modeling and generalization. We adopt high-fidelity simulation pretraining followed by real-world transfer. Experiments demonstrate stable 7 m/s obstacle avoidance in both simulation and physical deployment—outperforming CNN, U-Net, and recurrent baselines in success rate, robustness, and real-time performance. The method exhibits superior environmental generalization, reduced energy consumption, and enhanced robustness, overcoming key limitations of modular approaches in high-speed navigation scenarios.
📝 Abstract
We demonstrate the capabilities of an attention-based end-to-end approach for high-speed vision-based quadrotor obstacle avoidance in dense, cluttered environments, with comparison to various state-of-the-art learning architectures. Quadrotor unmanned aerial vehicles (UAVs) have tremendous maneuverability when flown fast; however, as flight speed increases, traditional model-based approaches to navigation via independent perception, mapping, planning, and control modules breaks down due to increased sensor noise, compounding errors, and increased processing latency. Thus, learning-based, end-to-end vision-to-control networks have shown to have great potential for online control of these fast robots through cluttered environments. We train and compare convolutional, U-Net, and recurrent architectures against vision transformer (ViT) models for depth image-to-control in high-fidelity simulation, observing that ViT models are more effective than others as quadrotor speeds increase and in generalization to unseen environments, while the addition of recurrence further improves performance while reducing quadrotor energy cost across all tested flight speeds. We assess performance at speeds of up to 7m/s in simulation and hardware. To the best of our knowledge, this is the first work to utilize vision transformers for end-to-end vision-based quadrotor control.