🤖 AI Summary
This work investigates the capacity of state-of-the-art music source separation (MSS) models to preserve spatial cues in binaural audio—a critical requirement for immersive applications (e.g., VR/AR) and accessibility, yet previously unexplored systematically. We construct a standardized stereo and synthetic binaural dataset derived from MUSDB18-HQ using publicly available head-related transfer functions (HRTFs), enabling the first systematic evaluation of how well leading MSS models retain binaural cues—including interaural time differences (ITD), interaural level differences (ILD), and spectral asymmetry. Results demonstrate that current models severely degrade spatial perception and immersion, with degradation severity varying across architectures and target instrument classes. This study uncovers a fundamental limitation of conventional MSS in spatial audio processing, establishes a reproducible benchmark for evaluating spatial fidelity, and motivates new modeling paradigms tailored to immersive audio scenarios.
📝 Abstract
Binaural audio remains underexplored within the music information retrieval community. Motivated by the rising popularity of virtual and augmented reality experiences as well as potential applications to accessibility, we investigate how well existing music source separation (MSS) models perform on binaural audio. Although these models process two-channel inputs, it is unclear how effectively they retain spatial information. In this work, we evaluate how several popular MSS models preserve spatial information on both standard stereo and novel binaural datasets. Our binaural data is synthesized using stems from MUSDB18-HQ and open-source head-related transfer functions by positioning instrument sources randomly along the horizontal plane. We then assess the spatial quality of the separated stems using signal processing and interaural cue-based metrics. Our results show that stereo MSS models fail to preserve the spatial information critical for maintaining the immersive quality of binaural audio, and that the degradation depends on model architecture as well as the target instrument. Finally, we highlight valuable opportunities for future work at the intersection of MSS and immersive audio.