🤖 AI Summary
This work addresses the challenging problem of audio-based pedestrian detection in high-noise traffic environments—particularly from roadside acoustic perspectives. We propose the first dedicated analytical framework for roadside acoustic perception, built upon a large-scale, synchronized audio-visual dataset comprising 1,321 hours of real-world road recordings, meticulously annotated with frame-level pedestrian labels and video thumbnails, and characterized by intense vehicular noise. Methodologically, we fuse 16-kHz audio with 1-fps visual cues to enable multimodal alignment, and conduct comprehensive evaluations including cross-dataset benchmarking, noise impact modeling, and cross-domain robustness testing. Experimental results demonstrate the critical role of acoustic context in detection performance and quantitatively reveal substantial degradation of existing models under complex noise conditions. Our contributions include: (1) the first public benchmark dataset for auditory pedestrian perception in traffic; (2) a reproducible, multimodal analytical framework; and (3) key empirical findings that advance understanding of audio-visual sensing in noisy urban environments.
📝 Abstract
Audio-based pedestrian detection is a challenging task and has, thus far, only been explored in noise-limited environments. We present a new dataset, results, and a detailed analysis of the state-of-the-art in audio-based pedestrian detection in the presence of vehicular noise. In our study, we conduct three analyses: (i) cross-dataset evaluation between noisy and noise-limited environments, (ii) an assessment of the impact of noisy data on model performance, highlighting the influence of acoustic context, and (iii) an evaluation of the model's predictive robustness on out-of-domain sounds. The new dataset is a comprehensive 1321-hour roadside dataset. It incorporates traffic-rich soundscapes. Each recording includes 16kHz audio synchronized with frame-level pedestrian annotations and 1fps video thumbnails.