🤖 AI Summary
To address inter-modal latency discrepancies (low-latency event streams vs. high-latency RGB frames) and training-inference temporal misalignment (sparse event-based labels vs. continuous RGB video) in event-RGB fusion detection, this paper proposes the Frequency-Adaptive Object Detector (FAOD). Its core contributions are: (1) an Align module for fine-grained spatiotemporal alignment between event streams and RGB frames; and (2) a Time Shift training paradigm that treats the event stream as the high-frequency primary reference and RGB as auxiliary—enabling robust detection under extreme 80× modal frequency disparity for the first time. On PKU-DAVIS-SOD, FAOD achieves a 9.8% mAP gain over prior work with only 25% of SODFormer’s parameters; under 80× frequency disparity, mAP degrades by merely 3%. It also sets a new state-of-the-art on DSEC-Detection.
📝 Abstract
Fusing Events and RGB images for object detection leverages the robustness of Event cameras in adverse environments and the rich semantic information provided by RGB cameras. However, two critical mismatches: low-latency Events extit{vs.}~high-latency RGB frames; temporally sparse labels in training extit{vs.}~continuous flow in inference, significantly hinder the high-frequency fusion-based object detection. To address these challenges, we propose the extbf{F}requency- extbf{A}daptive Low-Latency extbf{O}bject extbf{D}etector (FAOD). FAOD aligns low-frequency RGB frames with high-frequency Events through an Align Module, which reinforces cross-modal style and spatial proximity to address the Event-RGB Mismatch. We further propose a training strategy, Time Shift, which enforces the module to align the prediction from temporally shifted Event-RGB pairs and their original representation, that is, consistent with Event-aligned annotations. This strategy enables the network to use high-frequency Event data as the primary reference while treating low-frequency RGB images as supplementary information, retaining the low-latency nature of the Event stream toward high-frequency detection. Furthermore, we observe that these corrected Event-RGB pairs demonstrate better generalization from low training frequency to higher inference frequencies compared to using Event data alone. Extensive experiments on the PKU-DAVIS-SOD and DSEC-Detection datasets demonstrate that our FAOD achieves SOTA performance. Specifically, in the PKU-DAVIS-SOD Dataset, FAOD achieves 9.8 points improvement in terms of the mAP in fully paired Event-RGB data with only a quarter of the parameters compared to SODFormer, and even maintains robust performance (only a 3 points drop in mAP) under 80$ imes$ Event-RGB frequency mismatch.