🤖 AI Summary
Existing Event-RGB fusion detection datasets suffer from insufficient coverage of challenging scenarios (e.g., low-light, overexposure, high-speed motion) and low spatial resolution (≤640×480), hindering fair evaluation of multimodal detectors under adverse conditions. To address this, we introduce the first large-scale, pixel-level spatiotemporally aligned high-resolution Event-RGB object detection benchmark—comprising 130+ sequences and 340K fine-grained manual bounding-box annotations, with 57% depicting extreme conditions. The benchmark supports event-only, RGB-only, and fused multimodal inputs, establishing a high-fidelity evaluation standard for multimodal detection. Extensive experiments reveal that state-of-the-art fusion methods exhibit significant performance degradation under illumination degradation, whereas event-only models demonstrate superior robustness—indicating that current fusion strategies lack adaptability to modality mismatch.
📝 Abstract
Robust object detection for challenging scenarios increasingly relies on event cameras, yet existing Event-RGB datasets remain constrained by sparse coverage of extreme conditions and low spatial resolution (<= 640 x 480), which prevents comprehensive evaluation of detectors under challenging scenarios. To address these limitations, we propose PEOD, the first large-scale, pixel-aligned and high-resolution (1280 x 720) Event-RGB dataset for object detection under challenge conditions. PEOD contains 130+ spatiotemporal-aligned sequences and 340k manual bounding boxes, with 57% of data captured under low-light, overexposure, and high-speed motion. Furthermore, we benchmark 14 methods across three input configurations (Event-based, RGB-based, and Event-RGB fusion) on PEOD. On the full test set and normal subset, fusion-based models achieve the excellent performance. However, in illumination challenge subset, the top event-based model outperforms all fusion models, while fusion models still outperform their RGB-based counterparts, indicating limits of existing fusion methods when the frame modality is severely degraded. PEOD establishes a realistic, high-quality benchmark for multimodal perception and facilitates future research.