Audio and Multiscale Visual Cues Driven Cross-modal Transformer for Idling Vehicle Detection

📅 2025-04-15

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

To address challenges in idling vehicle detection—including poor audio-video modality alignment, low robustness to small targets and occlusions—this paper proposes an end-to-end cross-modal Transformer framework. It introduces a global patch-level audio-visual alignment mechanism to deeply fuse far-field audio with multi-scale surveillance video; integrates a multi-scale CNN feature pyramid with a spectrogram encoding module to jointly model spatiotemporal and spectral cues; and employs a decoupled detection head that separately optimizes bounding box regression and frame-level motion-state classification (driving/idling/off). Evaluated on the AVIVD dataset, the method achieves 72.38% mAP—surpassing isolated and end-to-end baseline models by 7.66 and 9.42 points, respectively—and significantly outperforms prior approaches across all motion-state APs, exceeding state-of-the-art sound source localization methods.

Technology Category

Application Category

📝 Abstract

Idling vehicle detection (IVD) supports real-time systems that reduce pollution and emissions by dynamically messaging drivers to curb excess idling behavior. In computer vision, IVD has become an emerging task that leverages video from surveillance cameras and audio from remote microphones to localize and classify vehicles in each frame as moving, idling, or engine-off. As with other cross-modal tasks, the key challenge lies in modeling the correspondence between audio and visual modalities, which differ in representation but provide complementary cues -- video offers spatial and motion context, while audio conveys engine activity beyond the visual field. The previous end-to-end model, which uses a basic attention mechanism, struggles to align these modalities effectively, often missing vehicle detections. To address this issue, we propose AVIVDNetv2, a transformer-based end-to-end detection network. It incorporates a cross-modal transformer with global patch-level learning, a multiscale visual feature fusion module, and decoupled detection heads. Extensive experiments show that AVIVDNetv2 improves mAP by 7.66 over the disjoint baseline and 9.42 over the E2E baseline, with consistent AP gains across all vehicle categories. Furthermore, AVIVDNetv2 outperforms the state-of-the-art method for sounding object localization, establishing a new performance benchmark on the AVIVD dataset.

Problem

Research questions and friction points this paper is trying to address.

Detects idling vehicles using audio and visual cross-modal fusion

Improves alignment of audio-visual cues for accurate vehicle classification

Enhances detection performance over existing baselines and state-of-the-art

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal transformer with global patch-level learning

Multiscale visual feature fusion module

Decoupled detection heads for improved accuracy

🔎 Similar Papers

Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention