🤖 AI Summary
This study addresses the ecological validity deficit of synthetically generated room impulse responses (RIRs) in monaural speech enhancement. To this end, we propose a frequency-dependent multi-band absorption coefficient modeling approach and integrate source and microphone directivity within the image-source method framework to construct a high-fidelity multi-band RIR (MB-RIR) dataset. Our method abandons the conventional single-band absorption assumption, substantially improving the fidelity of synthetic RIRs in characterizing real-world acoustic environments. The MB-RIR dataset is publicly available under an open-source, royalty-free license. Experiments on a real-RIR test set demonstrate that DeepFilterNet3 trained on MB-RIRs achieves a 0.51 dB improvement in signal-to-distortion ratio (SDR) and an 8.9-point gain in MUSHRA subjective listening scores over the baseline, confirming the enhanced generalizability and practical utility of the proposed approach.
📝 Abstract
We investigate the effects of four strategies for improving the ecological validity of synthetic room impulse response (RIR) datasets for monoaural Speech Enhancement (SE). We implement three features on top of the traditional image source method-based (ISM) shoebox RIRs: multiband absorption coefficients, source directivity and receiver directivity. We additionally consider mesh-based RIRs from the SoundSpaces dataset. We then train a DeepFilternet3 model for each RIR dataset and evaluate the performance on a test set of real RIRs both objectively and subjectively. We find that RIRs which use frequency-dependent acoustic absorption coefficients (MB-RIRs) can obtain +0.51dB of SDR and a +8.9 MUSHRA score when evaluated on real RIRs. The MB-RIRs dataset is publicly available for free download.