🤖 AI Summary
Existing blind estimation methods suffer from poor generalization across diverse acoustic environments under realistic noise conditions and typically estimate only a limited subset of room acoustic parameters (RAPs) or room geometric parameters (RGPs), failing to jointly infer critical physical quantities—including reverberation time (RT60), source distance/azimuth, and room occupancy ratio. To address this, we propose a Sparse Stochastic Impulse Response (SSIR) model tailored for single-channel noisy speech and design a unified encoder–multi-branch prediction framework. Our approach enables the first end-to-end joint blind estimation of both RAPs and room physical parameters (RPPs), requiring neither clean speech nor prior knowledge. By integrating sparse representation with joint optimization, we significantly enhance the fidelity of room impulse response (RIR) modeling. Evaluated on a newly constructed benchmark dataset, our method achieves state-of-the-art performance. The implementation is publicly available.
📝 Abstract
Room acoustic parameters (RAPs) and room physical parameters (RPPs) are essential metrics for parameterizing the room acoustical characteristics (RACs) of a sound field around a listener's local environment, offering comprehensive indications for various applications. Current RAP and RPP estimation methods either fall short of covering broad real-world acoustic environments in the context of real background noise or lack universal frameworks for blindly estimating RAPs and RPPs from noisy single-channel speech signals, particularly sound source distances, direction of arrival (DOA) of sound sources, and occupancy levels. On the other hand, in this paper, we propose a new universal blind estimation framework called the blind estimator of the room acoustical and physical parameters (BERP), by introducing a new stochastic room impulse response (RIR) model, namely the sparse stochastic impulse response (SSIR) model, and endowing the BERP with a unified encoder and multiple separate predictors to estimate the RPPs and the parameters SSIR in parallel. This estimation framework enables computationally efficient and universal estimation of room parameters using only noisy single-channel speech signals. Finally, all RAPs can be simultaneously derived from RIRs synthesized from the SSIR model with estimated parameters. To evaluate the effectiveness of the proposed BERP and SSIR models, we compile a task-specific dataset from several publicly available datasets. The results reveal that the BERP achieves state-of-the-art (SOTA) performance. In addition, the evaluation results for the SSIR RIR model also demonstrated its efficacy. The code is available on GitHub.