🤖 AI Summary
This work proposes a fully neural, end-to-end streaming acoustic echo cancellation (AEC) approach that eliminates reliance on traditional linear models and explicit time-delay estimation, which often struggle to balance low latency and high performance in complex acoustic environments. By discarding conventional linear components, the method leverages a progressive learning strategy, knowledge transfer from a pretrained linear AEC (LAEC) model, an attention-weighted loss alignment mechanism, and voice activity detection (VAD)-guided output masking to significantly enhance echo suppression and speech quality. Experimental results on public datasets demonstrate that the proposed system substantially outperforms existing methods, achieving real-time processing capability while maintaining robustness and practical applicability.
📝 Abstract
We propose a novel neural network-based end-to-end acoustic echo cancellation (E2E-AEC) method capable of streaming inference, which operates effectively without reliance on traditional linear AEC (LAEC) techniques and time delay estimation. Our approach includes several key strategies: First, we introduce and refine progressive learning to gradually enhance echo suppression. Second, our model employs knowledge transfer by initializing with a pre-trained LAECbased model, harnessing the insights gained from LAEC training. Third, we optimize the attention mechanism with a loss function applied on attention weights to achieve precise time alignment between the reference and microphone signals. Lastly, we incorporate voice activity detection to enhance speech quality and improve echo removal by masking the network output when near-end speech is absent. The effectiveness of our approach is validated through experiments conducted on public datasets.