RGB-Event Fusion with Self-Attention for Collision Prediction

📅 2025-05-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the urgent need for accurate time-to-collision and impact-location prediction in real-time UAV obstacle avoidance under dynamic environments, this paper proposes an RGB-event dual-modal fusion framework. We introduce cross-modal self-attention into a dual-encoder architecture for high-fidelity temporal alignment and feature integration. Our work systematically characterizes the event camera’s superiority in millisecond-level temporal modeling and quantitatively evaluates the accuracy–efficiency trade-off across 1–8-bit model quantization. On the ABCD benchmark, the fused model operates at 50 Hz, achieving an average 1% absolute improvement in overall prediction accuracy over single-modal baselines—reaching up to 10% for distant obstacles (>0.5 m). Temporal error (event-dominated) decreases by 26%, and spatial error by 4%. Furthermore, we demonstrate the feasibility of low-bitwidth deployment without significant performance degradation.

Technology Category

Application Category

📝 Abstract
Ensuring robust and real-time obstacle avoidance is critical for the safe operation of autonomous robots in dynamic, real-world environments. This paper proposes a neural network framework for predicting the time and collision position of an unmanned aerial vehicle with a dynamic object, using RGB and event-based vision sensors. The proposed architecture consists of two separate encoder branches, one for each modality, followed by fusion by self-attention to improve prediction accuracy. To facilitate benchmarking, we leverage the ABCD [8] dataset collected that enables detailed comparisons of single-modality and fusion-based approaches. At the same prediction throughput of 50Hz, the experimental results show that the fusion-based model offers an improvement in prediction accuracy over single-modality approaches of 1% on average and 10% for distances beyond 0.5m, but comes at the cost of +71% in memory and + 105% in FLOPs. Notably, the event-based model outperforms the RGB model by 4% for position and 26% for time error at a similar computational cost, making it a competitive alternative. Additionally, we evaluate quantized versions of the event-based models, applying 1- to 8-bit quantization to assess the trade-offs between predictive performance and computational efficiency. These findings highlight the trade-offs of multi-modal perception using RGB and event-based cameras in robotic applications.
Problem

Research questions and friction points this paper is trying to address.

Predicts UAV collision time and position using RGB-event fusion
Improves accuracy over single-modality methods but increases computational cost
Evaluates quantization trade-offs in event-based models for efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

RGB-Event fusion using self-attention mechanism
Dual encoder branches for separate modalities
Quantized event-based models for efficiency
🔎 Similar Papers
No similar papers found.