MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the degradation of scene awareness and social experience in extended reality (XR) caused by overlapping sound sources in complex acoustic environments. To this end, the authors propose MoXaRt, a real-time audio-visual fusion system that leverages visual cues—such as human faces and musical instruments—as guiding signals to enable fine-grained multi-source sound separation. The system employs a cascaded neural network architecture: an initial coarse audio disentanglement stage is followed by a refinement stage that integrates real-time visual detection outputs to enhance separation fidelity, supporting up to five concurrent sound sources. Evaluated on a newly curated dataset of 30 multi-source audio recordings and a user study with 22 participants, MoXaRt significantly improves speech intelligibility by 36.2% (p < 0.01) and substantially reduces cognitive load (p < 0.001).

Technology Category

Application Category

📝 Abstract
In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g., faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to 5 concurrent sources (e.g., 2 voices + 3 instruments) with ~2 second processing latency. We validate MoXaRt through a technical evaluation on a new dataset of 30 one-minute recordings featuring concurrent speech and music, and a 22-participant user study. Empirical results indicate that our system significantly enhances speech intelligibility, yielding a 36.2% (p<0.01) increase in listening comprehension within adversarial acoustic environments while substantially reducing cognitive load (p<0.001), thereby paving the way for more perceptive and socially adept XR experiences.
Problem

Research questions and friction points this paper is trying to address.

Extended Reality
sound source separation
audio-visual interaction
acoustic environment
scene awareness
Innovation

Methods, ideas, or system contributions that make the work stand out.

audio-visual separation
real-time XR system
source-guided refinement
speech intelligibility
cognitive load reduction
🔎 Similar Papers
No similar papers found.
T
Tianyu Xu
Google
S
Sieun Kim
University of Michigan
Q
Qianhui Zheng
University of Michigan
Ruoyu Xu
Ruoyu Xu
Zhejiang University, ByteDance
T
Tejasvi Ravi
Google
A
Anuva Kulkarni
Google
K
Katrina Passarella-Ward
Google
Junyi Zhu
Junyi Zhu
Assistant Professor, University of Michigan
HCIHealth SensingPersonal FabricationApplied Machine Learning
Adarsh Kowdle
Adarsh Kowdle
Engineering Director, Google AR/XR
3D Computer VisionMachine learning