🤖 AI Summary
Existing dereverberation methods heavily rely on scarce clean–reverberant speech pairs or exhibit poor generalization due to target metrics that lack cross-metric consistency, leading to degraded performance across evaluation criteria. This paper proposes a weakly supervised training paradigm requiring only reverberant speech and coarse RT60 estimates—no paired data needed. Our core contribution is a physics-informed, generative reverberation-matching loss based on learnable room impulse responses (RIRs): RT60 guides RIR synthesis, while waveform-level reconstruction error is jointly optimized. By embedding acoustic prior knowledge into neural modeling, the method significantly improves model robustness and generalization. Extensive experiments demonstrate consistent and stable superiority over state-of-the-art approaches across multiple objective metrics—including PESQ, STOI, and ESTOI—and crucially maintain substantial gains even under non-target evaluation metrics, confirming broad applicability and reliability.
📝 Abstract
This paper introduces a new training strategy to improve speech dereverberation systems using minimal acoustic information and reverberant (wet) speech. Most existing algorithms rely on paired dry/wet data, which is difficult to obtain, or on target metrics that may not adequately capture reverberation characteristics and can lead to poor results on non-target metrics. Our approach uses limited acoustic information, like the reverberation time (RT60), to train a dereverberation system. The system's output is resynthesized using a generated room impulse response and compared with the original reverberant speech, providing a novel reverberation matching loss replacing the standard target metrics. During inference, only the trained dereverberation model is used. Experimental results demonstrate that our method achieves more consistent performance across various objective metrics used in speech dereverberation than the state-of-the-art.