Prototype-Based Disentanglement for Controllable Dysarthric Speech Synthesis

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of disentangling speaker identity from pathological speech characteristics in dysarthric voice synthesis, a task hindered by high variability and scarce annotated data. To enhance controllability and robustness, the authors propose a prototype-guided disentanglement framework built upon a pre-trained text-to-speech (TTS) model, which separates vocal timbre and dysarthric articulation within a unified latent space. A pathology prototype codebook is introduced to yield interpretable representations, while dual classifiers combined with gradient reversal layers enforce invariance of speaker embeddings to pathological attributes, substantially improving disentanglement. Evaluated on the TORGO dataset, the method enables bidirectional conversion between healthy and dysarthric speech, significantly enhancing both speaker perceptual quality in reconstructed voices and downstream automatic speech recognition (ASR) performance.

Technology Category

Application Category

📝 Abstract
Dysarthric speech exhibits high variability and limited labeled data, posing major challenges for both automatic speech recognition (ASR) and assistive speech technologies. Existing approaches rely on synthetic data augmentation or speech reconstruction, yet often entangle speaker identity with pathological articulation, limiting controllability and robustness. In this paper, we propose ProtoDisent-TTS, a prototype-based disentanglement TTS framework built on a pre-trained text-to-speech backbone that factorizes speaker timbre and dysarthric articulation within a unified latent space. A pathology prototype codebook provides interpretable and controllable representations of healthy and dysarthric speech patterns, while a dual-classifier objective with a gradient reversal layer enforces invariance of speaker embeddings to pathological attributes. Experiments on the TORGO dataset demonstrate that this design enables bidirectional transformation between healthy and dysarthric speech, leading to consistent ASR performance gains and robust, speaker-aware speech reconstruction.
Problem

Research questions and friction points this paper is trying to address.

dysarthric speech synthesis
disentanglement
speaker identity
pathological articulation
controllability
Innovation

Methods, ideas, or system contributions that make the work stand out.

prototype-based disentanglement
dysarthric speech synthesis
pathology prototype codebook
gradient reversal layer
speaker-pathology factorization
🔎 Similar Papers
No similar papers found.
H
Haoshen Wang
Department of Language Science and Technology, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong SAR, China
X
Xueli Zhong
College of Rehabilitation Medicine, Fujian University of Traditional Chinese Medicine, Fuzhou 350122, China
B
Bingbing Lin
College of Rehabilitation Medicine, Fujian University of Traditional Chinese Medicine, Fuzhou 350122, China
J
Jia Huang
College of Rehabilitation Medicine, Fujian University of Traditional Chinese Medicine, Fuzhou 350122, China
X
Xingduo Pan
College of Rehabilitation Medicine, Fujian University of Traditional Chinese Medicine, Fuzhou 350122, China; Department of Imaging, Rehabilitation Hospital affiliated to Fujian University of Traditional Chinese Medicine, Fuzhou, Fujian 350003, China
Shengxiang Liang
Shengxiang Liang
Fujian University of Traditional Chinese Medicine
Nizhuan Wang
Nizhuan Wang
The Hong Kong Polytechnic University (PolyU)
AIBrain-Computer InterfaceNeuroimagingComputational LinguisticsNeurolinguistics
Wai Ting Siok
Wai Ting Siok
The Hong Kong Polytechnic University
Reading developmentChinese readingDevelopmental dyslexiaNeuroimagingfMRI