SonicSim: A customizable simulation platform for speech processing in moving sound source scenarios

📅 2024-10-02

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Current speech separation and enhancement models exhibit limited generalization under mobile-source scenarios, primarily due to insufficient diversity and realism in evaluation data—both real-world and synthetic datasets fail to adequately reflect practical acoustic conditions. To address this, we propose SonicSim: the first customizable acoustic simulation framework specifically designed for mobile sound sources. SonicSim leverages Habitat-sim for physically accurate, multi-source spatial modeling and integrates LibriSpeech, FSD50K, and FMA audio corpora with Matterport3D 3D indoor environments. Based on this framework, we construct SonicSet—a large-scale, high-fidelity benchmark dataset—and complement it with real-world counterpart recordings. Experiments demonstrate that models trained on SonicSet achieve significantly improved generalization on real mobile-source recordings compared to those trained on existing synthetic datasets, effectively narrowing the synthetic-to-real acoustic domain gap.

Technology Category

Application Category

📝 Abstract

Systematic evaluation of speech separation and enhancement models under moving sound source conditions requires extensive and diverse data. However, real-world datasets often lack sufficient data for training and evaluation, and synthetic datasets, while larger, lack acoustic realism. Consequently, neither effectively meets practical needs. To address this issue, we introduce SonicSim, a synthetic toolkit based on the embodied AI simulation platform Habitat-sim, designed to generate highly customizable data for moving sound sources. SonicSim supports multi-level adjustments, including scene-level, microphone-level, and source-level adjustments, enabling the creation of more diverse synthetic data. Leveraging SonicSim, we constructed a benchmark dataset called SonicSet, utilizing LibriSpeech, Freesound Dataset 50k (FSD50K), Free Music Archive (FMA), and 90 scenes from Matterport3D to evaluate speech separation and enhancement models. Additionally, to investigate the differences between synthetic and real-world data, we selected 5 hours of raw, non-reverberant data from the SonicSet validation set and recorded a real-world speech separation dataset, providing a reference for comparing SonicSet with other synthetic datasets. For speech enhancement, we utilized the real-world dataset RealMAN to validate the acoustic gap between SonicSet and existing synthetic datasets. The results indicate that models trained on SonicSet generalize better to real-world scenarios compared to other synthetic datasets. The code is publicly available at https://cslikai.cn/SonicSim/.

Problem

Research questions and friction points this paper is trying to address.

Lack of diverse data for speech processing in moving sound source scenarios.

Synthetic datasets lack acoustic realism for practical applications.

Need for customizable simulation tools to generate realistic synthetic data.

Innovation

Methods, ideas, or system contributions that make the work stand out.

SonicSim: customizable simulation for moving sound sources

Generates diverse synthetic data with multi-level adjustments

SonicSet benchmark dataset improves real-world model generalization

🔎 Similar Papers

No similar papers found.