🤖 AI Summary
Joint modeling of acoustic echo cancellation (AEC) and deep noise suppression (DNS) across heterogeneous edge-to-cloud deployment scenarios poses significant challenges in balancing model capacity, computational efficiency, and hardware adaptability.
Method: This paper proposes SMRU, a lightweight and scalable architecture featuring a novel frequency-band splitting–fusion mechanism and a variable-frame-rate recurrent U-Net. Key innovations include multi-scale band splitting/merging layers, causal temporal down-/up-sampling, dual-path modeling, and a recurrence-centric U-Net design enabling dynamic computational adaptation—from 50 M MACs/s to 6.8 G MACs/s.
Contribution/Results: SMRU consistently outperforms state-of-the-art baselines across DNSMOS, PESQ, and STOI metrics, achieving superior modeling capability and computational efficiency. It is the first AEC-DNS model to support cross-device complexity-adaptive deployment, significantly enhancing the universality and practicality of speech enhancement systems on heterogeneous hardware.
📝 Abstract
The proliferation of deep neural networks has spawned the rapid development of acoustic echo cancellation and noise suppression, and plenty of prior arts have been proposed, which yield promising performance. Nevertheless, they rarely consider the deployment generality in different processing scenarios, such as edge devices, and cloud processing. To this end, this paper proposes a general model, termed SMRU, to cover different application scenarios. The novelty lies in two-fold. First, a multi-scale band split layer and band merge layer are proposed to effectively fuse local frequency bands for lower complexity modeling. Besides, by simulating the multi-resolution feature modeling characteristic of the classical UNet structure, a novel recurrent-dominated UNet is devised. It consists of multiple variable frame rate blocks, each of which involves the causal time down-/upsampling layer with varying compression ratios and the dualpath structure for inter- and intra-band modeling. The model is configured from $50 mathrm{M} / mathrm{s}$ to $6.8 mathrm{G} / mathrm{s}$ in terms of MACs, and the experimental results show that the proposed approach yields competitive or even better performance over existing baselines, and has the full potential to adapt to more general scenarios with varying complexity requirements.