DiffDSR: Dysarthric Speech Reconstruction Using Latent Diffusion Model

📅 2025-05-31

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

To address the low intelligibility and speaker identity loss inherent in dysarthric speech, this paper proposes the first speech reconstruction framework based on Latent Diffusion Models (LDMs). Methodologically, we innovatively adapt LDMs to jointly model phoneme-level content restoration and speaker identity preservation: robust phoneme embeddings are extracted using self-supervised learning (SSL) models (e.g., wav2vec 2.0), then integrated with dedicated phoneme and speaker identity encoders; a context-aware learning mechanism further enables identity-conditioned generation. Experiments on the UASpeech dataset demonstrate significant improvements—18.7% reduction in Word Error Rate (WER) and 23.5% increase in speaker similarity (measured by cosine similarity)—thereby establishing a novel paradigm for dysarthric speech enhancement.

Technology Category

Application Category

📝 Abstract

Dysarthric speech reconstruction (DSR) aims to convert dysarthric speech into comprehensible speech while maintaining the speaker's identity. Despite significant advancements, existing methods often struggle with low speech intelligibility and poor speaker similarity. In this study, we introduce a novel diffusion-based DSR system that leverages a latent diffusion model to enhance the quality of speech reconstruction. Our model comprises: (i) a speech content encoder for phoneme embedding restoration via pre-trained self-supervised learning (SSL) speech foundation models; (ii) a speaker identity encoder for speaker-aware identity preservation by in-context learning mechanism; (iii) a diffusion-based speech generator to reconstruct the speech based on the restored phoneme embedding and preserved speaker identity. Through evaluations on the widely-used UASpeech corpus, our proposed model shows notable enhancements in speech intelligibility and speaker similarity.

Problem

Research questions and friction points this paper is trying to address.

Convert dysarthric speech to comprehensible speech

Improve speech intelligibility and speaker similarity

Use latent diffusion model for speech reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses latent diffusion model for speech reconstruction

Employs SSL speech foundation for phoneme embedding

Integrates in-context learning for speaker identity

🔎 Similar Papers

No similar papers found.