🤖 AI Summary
Existing speech enhancement models exhibit insufficient generalization across diverse distortions (seven types) and languages (five). Method: This paper proposes the first regression–generation bimodal Mamba architecture for unified speech enhancement. It introduces state-space models (SSMs) for cross-distortion joint modeling—the first such application in this domain—designs time-frequency structured encoding and sampling-rate-agnostic feature extraction, activates the generative branch for content inference in tasks like packet loss concealment and bandwidth extension, and adopts a regression-dominant, generation-auxiliary hybrid training paradigm. Contribution/Results: On Interspeech 2025 URGENT Challenge Track 1, the model achieves second place in blind evaluation using only a subset of training data, demonstrating significant improvements in robustness and generalization to unseen distortion–language combinations.
📝 Abstract
The Interspeech 2025 URGENT Challenge aimed to advance universal, robust, and generalizable speech enhancement by unifying speech enhancement tasks across a wide variety of conditions, including seven different distortion types and five languages. We present Universal Speech Enhancement Mamba (USEMamba), a state-space speech enhancement model designed to handle long-range sequence modeling, time-frequency structured processing, and sampling frequency-independent feature extraction. Our approach primarily relies on regression-based modeling, which performs well across most distortions. However, for packet loss and bandwidth extension, where missing content must be inferred, a generative variant of the proposed USEMamba proves more effective. Despite being trained on only a subset of the full training data, USEMamba achieved 2nd place in Track 1 during the blind test phase, demonstrating strong generalization across diverse conditions.