🤖 AI Summary
This work addresses the degradation of scene awareness and social experience in extended reality (XR) caused by overlapping sound sources in complex acoustic environments. To this end, the authors propose MoXaRt, a real-time audio-visual fusion system that leverages visual cues—such as human faces and musical instruments—as guiding signals to enable fine-grained multi-source sound separation. The system employs a cascaded neural network architecture: an initial coarse audio disentanglement stage is followed by a refinement stage that integrates real-time visual detection outputs to enhance separation fidelity, supporting up to five concurrent sound sources. Evaluated on a newly curated dataset of 30 multi-source audio recordings and a user study with 22 participants, MoXaRt significantly improves speech intelligibility by 36.2% (p < 0.01) and substantially reduces cognitive load (p < 0.001).
📝 Abstract
In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g., faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to 5 concurrent sources (e.g., 2 voices + 3 instruments) with ~2 second processing latency. We validate MoXaRt through a technical evaluation on a new dataset of 30 one-minute recordings featuring concurrent speech and music, and a 22-participant user study. Empirical results indicate that our system significantly enhances speech intelligibility, yielding a 36.2% (p<0.01) increase in listening comprehension within adversarial acoustic environments while substantially reducing cognitive load (p<0.001), thereby paving the way for more perceptive and socially adept XR experiences.