🤖 AI Summary
Real-world speech is often degraded by noise and reverberation; existing generative enhancement methods frequently introduce phonemic hallucinations and inconsistencies in speaker identity or paralinguistic attributes. To address these challenges, we propose the first latent-space diffusion Transformer framework specifically designed for realistically degraded speech, innovatively integrating robust conditional feature encoding with multi-scale time-frequency joint modeling to achieve high-fidelity full-bandwidth reconstruction. Evaluated on the DAPS dataset, our method achieves studio-quality audio fidelity—matching that of professionally recorded clean speech—for the first time. It attains state-of-the-art performance both objectively and subjectively, with a 12.3% improvement in speaker verification accuracy and a 37.6% reduction in phonemic hallucination rate.
📝 Abstract
Real-world speech recordings suffer from degradations such as background noise and reverberation. Speech enhancement aims to mitigate these issues by generating clean high-fidelity signals. While recent generative approaches for speech enhancement have shown promising results, they still face two major challenges: (1) content hallucination, where plausible phonemes generated differ from the original utterance; and (2) inconsistency, failing to preserve speaker's identity and paralinguistic features from the input speech. In this work, we introduce DiTSE (Diffusion Transformer for Speech Enhancement), which addresses quality issues of degraded speech in full bandwidth. Our approach employs a latent diffusion transformer model together with robust conditioning features, effectively addressing these challenges while remaining computationally efficient. Experimental results from both subjective and objective evaluations demonstrate that DiTSE achieves state-of-the-art audio quality that, for the first time, matches real studio-quality audio from the DAPS dataset. Furthermore, DiTSE significantly improves the preservation of speaker identity and content fidelity, reducing hallucinations across datasets compared to state-of-the-art enhancers. Audio samples are available at: http://hguimaraes.me/DiTSE