From Ground to Air: Noise Robustness in Vision Transformers and CNNs for Event-Based Vehicle Classification with Potential UAV Applications

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses vehicle classification from event-camera data in dynamic scenarios—such as drone-based vision—where sparse, asynchronous event streams pose unique challenges. We systematically compare the performance and noise robustness of a convolutional neural network (ResNet34) and a vision transformer (ViT-B16), fine-tuning both on the GEN1 dataset. To assess robustness, we introduce event-level noise perturbations—including timestamp jitter, polarity flipping, and event dropping—during evaluation. Results show that while ResNet34 achieves marginally higher accuracy on clean data (88% vs. 86%), ViT-B16 demonstrates markedly superior stability and generalization across all noise conditions. Notably, ViT-B16 maintains strong noise resilience even under limited pretraining data. These findings reveal an intrinsic advantage of ViT architectures in modeling sparse, temporally irregular event data, suggesting their suitability for resource-constrained, high-dynamics aerial perception tasks—providing both theoretical insight and empirical validation for transformer-based event-stream processing.

Technology Category

Application Category

📝 Abstract
This study investigates the performance of the two most relevant computer vision deep learning architectures, Convolutional Neural Network and Vision Transformer, for event-based cameras. These cameras capture scene changes, unlike traditional frame-based cameras with capture static images, and are particularly suited for dynamic environments such as UAVs and autonomous vehicles. The deep learning models studied in this work are ResNet34 and ViT B16, fine-tuned on the GEN1 event-based dataset. The research evaluates and compares these models under both standard conditions and in the presence of simulated noise. Initial evaluations on the clean GEN1 dataset reveal that ResNet34 and ViT B16 achieve accuracies of 88% and 86%, respectively, with ResNet34 showing a slight advantage in classification accuracy. However, the ViT B16 model demonstrates notable robustness, particularly given its pre-training on a smaller dataset. Although this study focuses on ground-based vehicle classification, the methodologies and findings hold significant promise for adaptation to UAV contexts, including aerial object classification and event-based vision systems for aviation-related tasks.
Problem

Research questions and friction points this paper is trying to address.

Compare CNN and ViT for event-based vehicle classification
Evaluate model robustness under simulated noise conditions
Explore UAV applications for event-based vision systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Event-based cameras for dynamic environments
ResNet34 and ViT B16 for vehicle classification
Robustness evaluation under simulated noise conditions
🔎 Similar Papers
No similar papers found.
N
Nouf Almesafri
Cranfield University, MK43 0AL Cranfield, UK Technology Innovation Institute, Masdar City, 9639 Abu Dhabi, UAE
H
Hector Figueiredo
Qinetiq, MK43 7TA Bedford, UK
Miguel Arana-Catania
Miguel Arana-Catania
Senior Research Software Engineer AI/NLP, University of Oxford