🤖 AI Summary
This work addresses the simultaneous detection and 3D localization of multiple sound events in binaural audio. Inspired by human spatial hearing, we propose the Binaural Time-Frequency Feature (BTFF) representation—the first to jointly encode interaural time differences (ITD), interaural level differences (ILD), and high-frequency spectral cues—enabling joint azimuth and elevation estimation. Leveraging BTFF, we construct an eight-channel input and design BiSELDnet, a CRNN-based architecture augmented with HRTF-informed spatial priors to enhance time-frequency pattern learning. Evaluated on our newly curated Binaural Set benchmark, the model achieves a SELD error of 0.110, an F-score of 87.1%, and a mean localization error of 4.4°, outperforming existing binaural SELD methods. This work establishes a novel paradigm for low-latency, high-accuracy spatial auditory modeling in real-world scenarios.
📝 Abstract
This paper introduces Binaural Sound Event Localization and Detection (BiSELD), a task that aims to jointly detect and localize multiple sound events using binaural audio, inspired by the spatial hearing mechanism of humans. To support this task, we present a synthetic benchmark dataset, called the Binaural Set, which simulates realistic auditory scenes using measured head-related transfer functions (HRTFs) and diverse sound events. To effectively address the BiSELD task, we propose a new input feature representation called the Binaural Time-Frequency Feature (BTFF), which encodes interaural time difference (ITD), interaural level difference (ILD), and high-frequency spectral cues (SC) from binaural signals. BTFF is composed of eight channels, including left and right mel-spectrograms, velocity-maps, SC-maps, and ITD-/ILD-maps, designed to cover different spatial cues across frequency bands and spatial axes. A CRNN-based model, BiSELDnet, is then developed to learn both spectro-temporal patterns and HRTF-based localization cues from BTFF. Experiments on the Binaural Set show that each BTFF sub-feature enhances task performance: V-map improves detection, ITD-/ILD-maps enable accurate horizontal localization, and SC-map captures vertical spatial cues. The final system achieves a SELD error of 0.110 with 87.1% F-score and 4.4° localization error, demonstrating the effectiveness of the proposed framework in mimicking human-like auditory perception.