Frequency-Adaptive Low-Latency Object Detection Using Events and Frames

📅 2024-12-05
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address inter-modal latency discrepancies (low-latency event streams vs. high-latency RGB frames) and training-inference temporal misalignment (sparse event-based labels vs. continuous RGB video) in event-RGB fusion detection, this paper proposes the Frequency-Adaptive Object Detector (FAOD). Its core contributions are: (1) an Align module for fine-grained spatiotemporal alignment between event streams and RGB frames; and (2) a Time Shift training paradigm that treats the event stream as the high-frequency primary reference and RGB as auxiliary—enabling robust detection under extreme 80× modal frequency disparity for the first time. On PKU-DAVIS-SOD, FAOD achieves a 9.8% mAP gain over prior work with only 25% of SODFormer’s parameters; under 80× frequency disparity, mAP degrades by merely 3%. It also sets a new state-of-the-art on DSEC-Detection.

Technology Category

Application Category

📝 Abstract
Fusing Events and RGB images for object detection leverages the robustness of Event cameras in adverse environments and the rich semantic information provided by RGB cameras. However, two critical mismatches: low-latency Events extit{vs.}~high-latency RGB frames; temporally sparse labels in training extit{vs.}~continuous flow in inference, significantly hinder the high-frequency fusion-based object detection. To address these challenges, we propose the extbf{F}requency- extbf{A}daptive Low-Latency extbf{O}bject extbf{D}etector (FAOD). FAOD aligns low-frequency RGB frames with high-frequency Events through an Align Module, which reinforces cross-modal style and spatial proximity to address the Event-RGB Mismatch. We further propose a training strategy, Time Shift, which enforces the module to align the prediction from temporally shifted Event-RGB pairs and their original representation, that is, consistent with Event-aligned annotations. This strategy enables the network to use high-frequency Event data as the primary reference while treating low-frequency RGB images as supplementary information, retaining the low-latency nature of the Event stream toward high-frequency detection. Furthermore, we observe that these corrected Event-RGB pairs demonstrate better generalization from low training frequency to higher inference frequencies compared to using Event data alone. Extensive experiments on the PKU-DAVIS-SOD and DSEC-Detection datasets demonstrate that our FAOD achieves SOTA performance. Specifically, in the PKU-DAVIS-SOD Dataset, FAOD achieves 9.8 points improvement in terms of the mAP in fully paired Event-RGB data with only a quarter of the parameters compared to SODFormer, and even maintains robust performance (only a 3 points drop in mAP) under 80$ imes$ Event-RGB frequency mismatch.
Problem

Research questions and friction points this paper is trying to address.

Fusing Events and RGB images
Low-latency Events vs. high-latency RGB frames
Temporally sparse labels in training vs. continuous flow in inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses Events and RGB images
Aligns low-frequency RGB frames
Uses Time Shift training strategy
🔎 Similar Papers
No similar papers found.
H
Haitian Zhang
Wuhan University, Wuhan, China
Xiangyuan Wang
Xiangyuan Wang
Wuhan University
Neuromorphic VisionImage ProcessingPattern Recognition
C
Chang Xu
EPFL, Lausanne, Switzerland
Xinya Wang
Xinya Wang
National Institutes of Health
Fang Xu
Fang Xu
Wuhan University
Image Processing
Huai Yu
Huai Yu
Wuhan University
RoboticsRobot VisionSLAM
L
Lei Yu
Wuhan University, Wuhan, China
W
Wen Yang
Wuhan University, Wuhan, China