🤖 AI Summary
To address poor generalization, high latency, and weak out-of-distribution robustness of RGB-based models in real-time UAV obstacle avoidance, this paper proposes an end-to-end, low-latency perception-decision system leveraging event cameras and FPGA hardware acceleration. Methodologically, it integrates dynamic vision sensor (DVS) event streams, a lightweight spiking neural network, spatiotemporal aggregation encoding, and an online inference pipeline, achieving a 2.14 ms end-to-end latency on a Xilinx Zynq platform. We introduce the first event-stream classification paradigm explicitly designed for motion-state discrimination to mitigate overfitting. Experiments demonstrate an equivalent frame rate of 1 kHz, with temporal and spatial errors of −20 ms and −20 mm, respectively. Motion/static classification accuracy reaches 78% (+59 percentage points over RGB baseline), and the F1-score improves to 0.73 (vs. 0.06 for the RGB baseline), significantly enhancing action prediction accuracy, robustness, and generalization capability.
📝 Abstract
This work quantitatively evaluates the performance of event-based vision systems (EVS) against conventional RGB-based models for action prediction in collision avoidance on an FPGA accelerator. Our experiments demonstrate that the EVS model achieves a significantly higher effective frame rate (1 kHz) and lower temporal (-20 ms) and spatial prediction errors (-20 mm) compared to the RGB-based model, particularly when tested on out-of-distribution data. The EVS model also exhibits superior robustness in selecting optimal evasion maneuvers. In particular, in distinguishing between movement and stationary states, it achieves a 59 percentage point advantage in precision (78% vs. 19%) and a substantially higher F1 score (0.73 vs. 0.06), highlighting the susceptibility of the RGB model to overfitting. Further analysis in different combinations of spatial classes confirms the consistent performance of the EVS model in both test data sets. Finally, we evaluated the system end-to-end and achieved a latency of approximately 2.14 ms, with event aggregation (1 ms) and inference on the processing unit (0.94 ms) accounting for the largest components. These results underscore the advantages of event-based vision for real-time collision avoidance and demonstrate its potential for deployment in resource-constrained environments.