Vision Transformers for End-to-End Vision-Based Quadrotor Obstacle Avoidance

📅 2024-05-16
🏛️ arXiv.org
📈 Citations: 5
Influential: 1
📄 PDF
🤖 AI Summary
In dense, cluttered environments, conventional model-based navigation fails for high-speed quadrotors (7 m/s) due to sensor noise, error accumulation, and processing latency. To address this, we propose a vision transformer (ViT)-based end-to-end obstacle avoidance control framework that directly maps depth images to low-level control commands. This work introduces ViT to end-to-end visual navigation for the first time, integrating a recursive mechanism to enhance temporal modeling and generalization. We adopt high-fidelity simulation pretraining followed by real-world transfer. Experiments demonstrate stable 7 m/s obstacle avoidance in both simulation and physical deployment—outperforming CNN, U-Net, and recurrent baselines in success rate, robustness, and real-time performance. The method exhibits superior environmental generalization, reduced energy consumption, and enhanced robustness, overcoming key limitations of modular approaches in high-speed navigation scenarios.

Technology Category

Application Category

📝 Abstract
We demonstrate the capabilities of an attention-based end-to-end approach for high-speed vision-based quadrotor obstacle avoidance in dense, cluttered environments, with comparison to various state-of-the-art learning architectures. Quadrotor unmanned aerial vehicles (UAVs) have tremendous maneuverability when flown fast; however, as flight speed increases, traditional model-based approaches to navigation via independent perception, mapping, planning, and control modules breaks down due to increased sensor noise, compounding errors, and increased processing latency. Thus, learning-based, end-to-end vision-to-control networks have shown to have great potential for online control of these fast robots through cluttered environments. We train and compare convolutional, U-Net, and recurrent architectures against vision transformer (ViT) models for depth image-to-control in high-fidelity simulation, observing that ViT models are more effective than others as quadrotor speeds increase and in generalization to unseen environments, while the addition of recurrence further improves performance while reducing quadrotor energy cost across all tested flight speeds. We assess performance at speeds of up to 7m/s in simulation and hardware. To the best of our knowledge, this is the first work to utilize vision transformers for end-to-end vision-based quadrotor control.
Problem

Research questions and friction points this paper is trying to address.

End-to-end vision-based obstacle avoidance for high-speed quadrotors
Overcoming limitations of traditional model-based navigation approaches
Comparing learning architectures for depth image-to-control in cluttered environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Transformers for end-to-end quadrotor control
Attention-based approach for obstacle avoidance
Combines ViT with recurrence for energy efficiency
🔎 Similar Papers
No similar papers found.
Anish Bhattacharya
Anish Bhattacharya
The General Robotics, Automation, Sensing & Perception (GRASP) Lab, University of Pennsylvania
N
Nishanth Rao
The General Robotics, Automation, Sensing & Perception (GRASP) Lab, University of Pennsylvania
D
Dhruv Parikh
The General Robotics, Automation, Sensing & Perception (GRASP) Lab, University of Pennsylvania
Pratik Kunapuli
Pratik Kunapuli
Graduate Student, University of Pennsylvania
RoboticsMachine LearningArtificial Intelligence
Nikolai Matni
Nikolai Matni
Associate Professor of Electrical and Systems Engineering, University of Pennsylvania
Control TheoryMachine LearningOptimization
V
Vijay Kumar
The General Robotics, Automation, Sensing & Perception (GRASP) Lab, University of Pennsylvania