Universal Speech Enhancement with Regression and Generative Mamba

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speech enhancement models exhibit insufficient generalization across diverse distortions (seven types) and languages (five). Method: This paper proposes the first regression–generation bimodal Mamba architecture for unified speech enhancement. It introduces state-space models (SSMs) for cross-distortion joint modeling—the first such application in this domain—designs time-frequency structured encoding and sampling-rate-agnostic feature extraction, activates the generative branch for content inference in tasks like packet loss concealment and bandwidth extension, and adopts a regression-dominant, generation-auxiliary hybrid training paradigm. Contribution/Results: On Interspeech 2025 URGENT Challenge Track 1, the model achieves second place in blind evaluation using only a subset of training data, demonstrating significant improvements in robustness and generalization to unseen distortion–language combinations.

Technology Category

Application Category

📝 Abstract
The Interspeech 2025 URGENT Challenge aimed to advance universal, robust, and generalizable speech enhancement by unifying speech enhancement tasks across a wide variety of conditions, including seven different distortion types and five languages. We present Universal Speech Enhancement Mamba (USEMamba), a state-space speech enhancement model designed to handle long-range sequence modeling, time-frequency structured processing, and sampling frequency-independent feature extraction. Our approach primarily relies on regression-based modeling, which performs well across most distortions. However, for packet loss and bandwidth extension, where missing content must be inferred, a generative variant of the proposed USEMamba proves more effective. Despite being trained on only a subset of the full training data, USEMamba achieved 2nd place in Track 1 during the blind test phase, demonstrating strong generalization across diverse conditions.
Problem

Research questions and friction points this paper is trying to address.

Advance universal, robust speech enhancement across diverse conditions
Handle long-range sequence modeling and frequency-independent feature extraction
Improve performance for packet loss and bandwidth extension tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

State-space model for long-range sequence modeling
Regression-based modeling for most distortions
Generative variant for packet loss and bandwidth extension
🔎 Similar Papers
No similar papers found.