DiT-Flow: Speech Enhancement Robust to Multiple Distortions based on Flow Matching in Latent Space and Diffusion Transformers

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the limited generalization of current speech enhancement models, which are typically trained on scarce data under single degradation conditions and thus struggle in real-world scenarios involving complex, unseen distortions such as noise, reverberation, and compression. To overcome this, the authors propose DiT-Flow, a novel framework that, for the first time, integrates flow matching with Diffusion Transformers (DiT) in the latent space of a variational autoencoder. The approach further incorporates a lightweight fusion architecture combining Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE). Remarkably, with only 4.9% trainable parameters, DiT-Flow significantly outperforms state-of-the-art generative speech enhancement models on the StillSonicSet dataset, demonstrating exceptional generalization across five unseen distortion types and robust multi-condition performance.

Technology Category

Application Category

📝 Abstract

Recent advances in generative models, such as diffusion and flow matching, have shown strong performance in audio tasks. However, speech enhancement (SE) models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. To address this, we propose DiT-Flow, a flow matching-based SE framework built on the latent Diffusion Transformer (DiT) backbone and trained for robustness across diverse distortions, including noise, reverberation, and compression. DiT-Flow operates on compact variational auto-encoders (VAEs)-derived latent features. We validated our approach on StillSonicSet, a synthetic yet acoustically realistic dataset composed of LibriSpeech, FSD50K, FMA, and 90 Matterport3D scenes. Experiments show that DiT-Flow consistently outperforms state-of-the-art generative SE models, demonstrating the effectiveness of flow matching in multi-condition speech enhancement. Despite ongoing efforts to expand synthetic data realism, a persistent bottleneck in SE is the inevitable mismatch between training and deployment conditions. By integrating LoRA with the MoE framework, we achieve both parameter-efficient and high-performance training for DiT-Flow robust to multiple distortions with using 4.9% percentage of the total parameters to obtain a better performance on five unseen distortions.

Problem

Research questions and friction points this paper is trying to address.

speech enhancement

multiple distortions

real-world applicability

training-deployment mismatch

robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

flow matching

Diffusion Transformer

latent space