Masked Autoencoders as Universal Speech Enhancer

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work proposes the first general-purpose speech enhancement framework based on a masked autoencoder to address the practical challenges of lacking clean speech labels and the limited generalization of existing supervised methods. The approach leverages self-supervised pretraining to reconstruct masked spectrogram regions and remove diverse synthetic distortions, followed by fine-tuning with a small amount of paired data for downstream tasks. By innovatively adapting the masked autoencoder to speech enhancement, the method enables joint handling of multiple distortion types. Robustness is further enhanced through log1p-compressed spectrograms and tailored data augmentation strategies. Experimental results demonstrate state-of-the-art performance in both denoising and dereverberation tasks, significantly outperforming current methods on both in-domain and out-of-domain datasets.

Technology Category

Application Category

📝 Abstract

Supervised speech enhancement methods have been very successful. However, in practical scenarios, there is a lack of clean speech, and self-supervised learning-based (SSL) speech enhancement methods that offer comparable enhancement performance and can be applied to other speech-related downstream applications are desired. In this work, we develop a masked autoencoder based universal speech enhancer that is agnostic to the type of distortion affecting speech, can handle multiple distortions simultaneously, and is trained in a self-supervised manner. An augmentation stack adds further distortions to the noisy input data. The masked autoencoder model learns to remove the added distortions along with reconstructing the masked regions of the spectrogram during pre-training. The pre-trained embeddings are then used by fine-tuning models trained on a small amount of paired data for specific downstream tasks. We evaluate the pre-trained features for denoising and dereverberation downstream tasks. We explore different augmentations (like single or multi-speaker) in the pre-training augmentation stack and the effect of different noisy input feature representations (like $log1p$ compression) on pre-trained embeddings and downstream fine-tuning enhancement performance. We show that the proposed method not only outperforms the baseline but also achieves state-of-the-art performance for both in-domain and out-of-domain evaluation datasets.

Problem

Research questions and friction points this paper is trying to address.

speech enhancement

self-supervised learning

masked autoencoder

distortion-agnostic

downstream tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked Autoencoder

Self-supervised Learning

Universal Speech Enhancement