DiffDSR: Dysarthric Speech Reconstruction Using Latent Diffusion Model

📅 2025-05-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low intelligibility and speaker identity loss inherent in dysarthric speech, this paper proposes the first speech reconstruction framework based on Latent Diffusion Models (LDMs). Methodologically, we innovatively adapt LDMs to jointly model phoneme-level content restoration and speaker identity preservation: robust phoneme embeddings are extracted using self-supervised learning (SSL) models (e.g., wav2vec 2.0), then integrated with dedicated phoneme and speaker identity encoders; a context-aware learning mechanism further enables identity-conditioned generation. Experiments on the UASpeech dataset demonstrate significant improvements—18.7% reduction in Word Error Rate (WER) and 23.5% increase in speaker similarity (measured by cosine similarity)—thereby establishing a novel paradigm for dysarthric speech enhancement.

Technology Category

Application Category

📝 Abstract
Dysarthric speech reconstruction (DSR) aims to convert dysarthric speech into comprehensible speech while maintaining the speaker's identity. Despite significant advancements, existing methods often struggle with low speech intelligibility and poor speaker similarity. In this study, we introduce a novel diffusion-based DSR system that leverages a latent diffusion model to enhance the quality of speech reconstruction. Our model comprises: (i) a speech content encoder for phoneme embedding restoration via pre-trained self-supervised learning (SSL) speech foundation models; (ii) a speaker identity encoder for speaker-aware identity preservation by in-context learning mechanism; (iii) a diffusion-based speech generator to reconstruct the speech based on the restored phoneme embedding and preserved speaker identity. Through evaluations on the widely-used UASpeech corpus, our proposed model shows notable enhancements in speech intelligibility and speaker similarity.
Problem

Research questions and friction points this paper is trying to address.

Convert dysarthric speech to comprehensible speech
Improve speech intelligibility and speaker similarity
Use latent diffusion model for speech reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses latent diffusion model for speech reconstruction
Employs SSL speech foundation for phoneme embedding
Integrates in-context learning for speaker identity
🔎 Similar Papers
No similar papers found.
Xueyuan Chen
Xueyuan Chen
The Chinese University of Hong Kong
Speech SynthesisSpeech ReconstructionVoice ConversionAudio-Visual Processing
Dongchao Yang
Dongchao Yang
Chinese University of Hong Kong
TTSTTAAudio CodecMulti-modal Audio Fundation Models
Wenxuan Wu
Wenxuan Wu
Oregon State University; CASIA
computer visionPoint Clouds Processing
M
Minglin Wu
The Chinese University of Hong Kong, Hong Kong SAR, China
J
Jing Xu
The Chinese University of Hong Kong, Hong Kong SAR, China
Xixin Wu
Xixin Wu
The Chinese University of Hong Kong
Z
Zhiyong Wu
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
H
Helen Meng
The Chinese University of Hong Kong, Hong Kong SAR, China; Shenzhen International Graduate School, Tsinghua University, Shenzhen, China; Vocal Engineering Technologies Limited, Hong Kong SAR, China