Binaural Sound Event Localization and Detection based on HRTF Cues for Humanoid Robots

📅 2025-07-28

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the simultaneous detection and 3D localization of multiple sound events in binaural audio. Inspired by human spatial hearing, we propose the Binaural Time-Frequency Feature (BTFF) representation—the first to jointly encode interaural time differences (ITD), interaural level differences (ILD), and high-frequency spectral cues—enabling joint azimuth and elevation estimation. Leveraging BTFF, we construct an eight-channel input and design BiSELDnet, a CRNN-based architecture augmented with HRTF-informed spatial priors to enhance time-frequency pattern learning. Evaluated on our newly curated Binaural Set benchmark, the model achieves a SELD error of 0.110, an F-score of 87.1%, and a mean localization error of 4.4°, outperforming existing binaural SELD methods. This work establishes a novel paradigm for low-latency, high-accuracy spatial auditory modeling in real-world scenarios.

Technology Category

Application Category

📝 Abstract

This paper introduces Binaural Sound Event Localization and Detection (BiSELD), a task that aims to jointly detect and localize multiple sound events using binaural audio, inspired by the spatial hearing mechanism of humans. To support this task, we present a synthetic benchmark dataset, called the Binaural Set, which simulates realistic auditory scenes using measured head-related transfer functions (HRTFs) and diverse sound events. To effectively address the BiSELD task, we propose a new input feature representation called the Binaural Time-Frequency Feature (BTFF), which encodes interaural time difference (ITD), interaural level difference (ILD), and high-frequency spectral cues (SC) from binaural signals. BTFF is composed of eight channels, including left and right mel-spectrograms, velocity-maps, SC-maps, and ITD-/ILD-maps, designed to cover different spatial cues across frequency bands and spatial axes. A CRNN-based model, BiSELDnet, is then developed to learn both spectro-temporal patterns and HRTF-based localization cues from BTFF. Experiments on the Binaural Set show that each BTFF sub-feature enhances task performance: V-map improves detection, ITD-/ILD-maps enable accurate horizontal localization, and SC-map captures vertical spatial cues. The final system achieves a SELD error of 0.110 with 87.1% F-score and 4.4° localization error, demonstrating the effectiveness of the proposed framework in mimicking human-like auditory perception.

Problem

Research questions and friction points this paper is trying to address.

Detect and localize multiple sound events using binaural audio

Simulate realistic auditory scenes with HRTFs and diverse sounds

Develop input features and model for human-like auditory perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

Binaural Time-Frequency Feature (BTFF) encoding

CRNN-based model BiSELDnet for learning

Synthetic Binaural Set dataset simulation

🔎 Similar Papers

No similar papers found.