🤖 AI Summary
Data-driven acoustic echo cancellation (AEC) models suffer from poor generalization to unseen real-room scenarios, primarily due to the unobservable echo path and the absence of transferable priors. To address this, this work introduces room impulse responses (RIRs) as explicit prompts into an end-to-end AEC framework—marking the first such integration. We propose four multimodal fusion mechanisms for jointly encoding RIRs and speech features, and adopt a joint evaluation strategy using both simulated and real-world RIRs. Our core contribution lies in leveraging RIR embeddings to guide the model in learning robust, generalizable echo-path priors. Experiments demonstrate substantial improvements: PESQ increases by over 1.2, and ERLE improves by an average of 4.8 dB across unseen simulated and real acoustic environments—significantly outperforming RIR-agnostic baseline models.
📝 Abstract
Data-driven acoustic echo cancellation (AEC) methods, predominantly trained on synthetic or constrained real-world datasets, encounter performance declines in unseen echo scenarios, especially in real environments where echo paths are not directly observable. Our proposed method counters this limitation by integrating room impulse response (RIR) as a pivotal training prompt, aiming to improve the generalization of AEC models in such unforeseen conditions. We also explore four RIR prompt fusion methods. Comprehensive evaluations, including both simulated RIR under unknown conditions and recorded RIR in real, demonstrate that the proposed approach significantly improves performance compared to baseline models. These results substantiate the effectiveness of our RIR-guided approach in strengthening the model's generalization capabilities.