🤖 AI Summary
This study systematically quantifies the sim-to-real gap in CARLA’s Dynamic Vision Sensor (DVS) module—specifically its fidelity in event camera modeling—and its impact on traffic object detection performance.
Method: We propose an evaluation paradigm wherein a Recurrent Vision Transformer (RVT) is trained exclusively on synthetic event data generated by CARLA’s DVS, and then evaluated for cross-domain generalization on both real-world and synthetic-mixed event streams.
Contribution/Results: We present the first quantitative evidence of significant simulation distortion in CARLA’s DVS: models trained solely on synthetic events suffer over 40% mAP degradation on real event data, whereas models trained on real data exhibit strong cross-domain robustness. These findings identify insufficient DVS simulation fidelity as the primary bottleneck limiting event-based perception performance. Consequently, improving simulation accuracy and developing event-camera-specific domain adaptation methods are critically needed.
📝 Abstract
Event cameras are gaining traction in traffic monitoring applications due to their low latency, high temporal resolution, and energy efficiency, which makes them well-suited for real-time object detection at traffic intersections. However, the development of robust event-based detection models is hindered by the limited availability of annotated real-world datasets. To address this, several simulation tools have been developed to generate synthetic event data. Among these, the CARLA driving simulator includes a built-in dynamic vision sensor (DVS) module that emulates event camera output. Despite its potential, the sim-to-real gap for event-based object detection remains insufficiently studied. In this work, we present a systematic evaluation of this gap by training a recurrent vision transformer model exclusively on synthetic data generated using CARLAs DVS and testing it on varying combinations of synthetic and real-world event streams. Our experiments show that models trained solely on synthetic data perform well on synthetic-heavy test sets but suffer significant performance degradation as the proportion of real-world data increases. In contrast, models trained on real-world data demonstrate stronger generalization across domains. This study offers the first quantifiable analysis of the sim-to-real gap in event-based object detection using CARLAs DVS. Our findings highlight limitations in current DVS simulation fidelity and underscore the need for improved domain adaptation techniques in neuromorphic vision for traffic monitoring.