Treble10: A high-quality dataset for far-field speech recognition, dereverberation, and enhancement

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Existing far-field speech datasets struggle to balance acoustic fidelity and scalability: real-world recordings are costly, narrowly scoped, and lack paired clean/reverberant utterances; synthetic data, meanwhile, relies on geometric acoustics approximations that neglect critical wave phenomena—such as diffraction, scattering, and interference. To address this, we propose a hybrid simulation paradigm integrating wave-based and geometric acoustics, leveraging the Treble SDK to generate high-fidelity, wideband room impulse responses (RIRs) that accurately model complex indoor sound propagation. We construct a large-scale far-field speech dataset built upon LibriSpeech, featuring over 3,000 RIRs, diverse microphone array configurations, and Ambisonics-format recordings at 32 kHz. The dataset is publicly released on Hugging Face, enabling benchmarking and data augmentation for automatic speech recognition, dereverberation, and speech enhancement—significantly narrowing the physical realism gap between measured and synthetic far-field speech data.

Technology Category

Application Category

📝 Abstract

Accurate far-field speech datasets are critical for tasks such as automatic speech recognition (ASR), dereverberation, speech enhancement, and source separation. However, current datasets are limited by the trade-off between acoustic realism and scalability. Measured corpora provide faithful physics but are expensive, low-coverage, and rarely include paired clean and reverberant data. In contrast, most simulation-based datasets rely on simplified geometrical acoustics, thus failing to reproduce key physical phenomena like diffraction, scattering, and interference that govern sound propagation in complex environments. We introduce Treble10, a large-scale, physically accurate room-acoustic dataset. Treble10 contains over 3000 broadband room impulse responses (RIRs) simulated in 10 fully furnished real-world rooms, using a hybrid simulation paradigm implemented in the Treble SDK that combines a wave-based and geometrical acoustics solver. The dataset provides six complementary subsets, spanning mono, 8th-order Ambisonics, and 6-channel device RIRs, as well as pre-convolved reverberant speech scenes paired with LibriSpeech utterances. All signals are simulated at 32 kHz, accurately modelling low-frequency wave effects and high-frequency reflections. Treble10 bridges the realism gap between measurement and simulation, enabling reproducible, physically grounded evaluation and large-scale data augmentation for far-field speech tasks. The dataset is openly available via the Hugging Face Hub, and is intended as both a benchmark and a template for next-generation simulation-driven audio research.

Problem

Research questions and friction points this paper is trying to address.

Addresses limited acoustic realism in simulated far-field speech datasets

Bridges the gap between measured and simulated room acoustics data

Provides physically accurate dataset for far-field speech processing tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid wave-based and geometrical acoustics solver

Simulated 3000 broadband room impulse responses

Six complementary subsets with multichannel RIRs

🔎 Similar Papers

ctPuLSE: Close-Talk, and Pseudo-Label Based Far-Field, Speech Enhancement

2024-07-28arXiv.orgCitations: 2

Anthropic

$350,000—$500,000 USD

San Francisco, CA, USA

Research Scientist Intern, Multimodal AI (PhD)