🤖 AI Summary
To address challenges in idling vehicle detection—including poor audio-video modality alignment, low robustness to small targets and occlusions—this paper proposes an end-to-end cross-modal Transformer framework. It introduces a global patch-level audio-visual alignment mechanism to deeply fuse far-field audio with multi-scale surveillance video; integrates a multi-scale CNN feature pyramid with a spectrogram encoding module to jointly model spatiotemporal and spectral cues; and employs a decoupled detection head that separately optimizes bounding box regression and frame-level motion-state classification (driving/idling/off). Evaluated on the AVIVD dataset, the method achieves 72.38% mAP—surpassing isolated and end-to-end baseline models by 7.66 and 9.42 points, respectively—and significantly outperforms prior approaches across all motion-state APs, exceeding state-of-the-art sound source localization methods.
📝 Abstract
Idling vehicle detection (IVD) supports real-time systems that reduce pollution and emissions by dynamically messaging drivers to curb excess idling behavior. In computer vision, IVD has become an emerging task that leverages video from surveillance cameras and audio from remote microphones to localize and classify vehicles in each frame as moving, idling, or engine-off. As with other cross-modal tasks, the key challenge lies in modeling the correspondence between audio and visual modalities, which differ in representation but provide complementary cues -- video offers spatial and motion context, while audio conveys engine activity beyond the visual field. The previous end-to-end model, which uses a basic attention mechanism, struggles to align these modalities effectively, often missing vehicle detections. To address this issue, we propose AVIVDNetv2, a transformer-based end-to-end detection network. It incorporates a cross-modal transformer with global patch-level learning, a multiscale visual feature fusion module, and decoupled detection heads. Extensive experiments show that AVIVDNetv2 improves mAP by 7.66 over the disjoint baseline and 9.42 over the E2E baseline, with consistent AP gains across all vehicle categories. Furthermore, AVIVDNetv2 outperforms the state-of-the-art method for sounding object localization, establishing a new performance benchmark on the AVIVD dataset.