MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations

📅 2025-10-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing spatial audio research is hindered by reliance on monaural datasets, limiting immersive multimodal modeling. To address this, we introduce SpatioAudio-4K—the first large-scale, multimodal spatial audio dataset—featuring synchronized binaural and Ambisonic audio, egocentric and exocentric video, and 6DoF motion trajectories, captured across four realistic domains: daily life, speech, music, and singing. The dataset includes fine-grained annotations—speech transcriptions, phonemes, lyrics, musical scores, and precise 3D sound source locations—enabling rigorous cross-modal alignment. We validate SpatioAudio-4K on key tasks including audio spatialization, spatial speech/singing/music synthesis, and sound event localization. Experimental results demonstrate substantial improvements in spatial perception fidelity and generative quality. By bridging critical gaps in data scale, modality diversity, and annotation richness, SpatioAudio-4K establishes a foundational resource for advancing multimodal spatial audio understanding and generation.

Technology Category

Application Category

📝 Abstract
Humans rely on multisensory integration to perceive spatial environments, where auditory cues enable sound source localization in three-dimensional space. Despite the critical role of spatial audio in immersive technologies such as VR/AR, most existing multimodal datasets provide only monaural audio, which limits the development of spatial audio generation and understanding. To address these challenges, we introduce MRSAudio, a large-scale multimodal spatial audio dataset designed to advance research in spatial audio understanding and generation. MRSAudio spans four distinct components: MRSLife, MRSSpeech, MRSMusic, and MRSSing, covering diverse real-world scenarios. The dataset includes synchronized binaural and ambisonic audio, exocentric and egocentric video, motion trajectories, and fine-grained annotations such as transcripts, phoneme boundaries, lyrics, scores, and prompts. To demonstrate the utility and versatility of MRSAudio, we establish five foundational tasks: audio spatialization, and spatial text to speech, spatial singing voice synthesis, spatial music generation and sound event localization and detection. Results show that MRSAudio enables high-quality spatial modeling and supports a broad range of spatial audio research. Demos and dataset access are available at https://mrsaudio.github.io.
Problem

Research questions and friction points this paper is trying to address.

Addressing limited spatial audio data for immersive technologies
Providing multimodal synchronized recordings with refined annotations
Enabling spatial audio generation and understanding across diverse scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale multimodal spatial audio dataset
Synchronized binaural and ambisonic audio recordings
Fine-grained annotations for spatial audio tasks
🔎 Similar Papers
No similar papers found.
W
Wenxiang Guo
Zhejiang University
Changhao Pan
Changhao Pan
Zhejiang University
Multi-Modal Genarative AISinging Voice Synthesis
Zhiyuan Zhu
Zhiyuan Zhu
Shanghai Jiao Tong University
NLPASRTTS
X
Xintong Hu
Zhejiang University
Y
Yu Zhang
Zhejiang University
L
Li Tang
Zhejiang University
R
Rui Yang
Zhejiang University
H
Han Wang
Zhejiang University
Z
Zongbao Zhang
Zhejiang University
Y
Yuhan Wang
Zhejiang University
Yixuan Chen
Yixuan Chen
Oxford Suzhou Center for Advanced Research
DisentanglementVision-Language ModelAI for Medical
H
Hankun Xu
Zhejiang University
K
Ke Xu
Zhejiang University
P
Pengfei Fan
Zhejiang University
Z
Zhetao Chen
Zhejiang University
Yanhao Yu
Yanhao Yu
Zhejiang University
Q
Qiange Huang
Zhejiang University
F
Fei Wu
Zhejiang University
Zhou Zhao
Zhou Zhao
Zhejiang University
Machine LearningData MiningMultimedia Computing