π€ AI Summary
This work proposes a binaural target speaker extraction framework that leverages the listenerβs head-related transfer function (HRTF) as a spatial prior, addressing the limitations of conventional methods that rely on direction-of-arrival estimation or enrollment signals and often induce spatial auditory distortions. For the first time, individualized HRTFs are explicitly incorporated into a multichannel deep blind source separation model, enabling HRTF-guided conditional extraction without requiring user-specific calibration while achieving cross-listener generalization. The approach effectively preserves binaural spatial localization cues and substantially enhances both speech quality and intelligibility. Experimental validation using real-measured HRTF data demonstrates superior performance in both simulated and real-recorded scenarios.
π Abstract
This paper presents a Head-Related Transfer Function (HRTF)-guided framework for binaural Target Speaker Extraction (TSE) from mixtures of concurrent sources. Unlike conventional TSE methods based on Direction of Arrival (DOA) estimation or enrollment signals, which often distort perceived spatial location, the proposed approach leverages the listener's HRTF as an explicit spatial prior. The proposed framework is built upon a multi-channel deep blind source separation backbone, adapted to the binaural TSE setting. It is trained on measured HRTFs from a diverse population, enabling cross-listener generalization rather than subject-specific tuning. By conditioning the extraction on HRTF-derived spatial information, the method preserves binaural cues while enhancing speech quality and intelligibility. The performance of the proposed framework is validated through simulations and real recordings obtained from a head and torso simulator (HATS).