🤖 AI Summary
To address weak phoneme discriminability and severe acoustic mismatch in unsupervised discrete token learning for dysarthric speech recognition, this paper proposes a phoneme-purity-guided discretization framework that, for the first time, integrates phoneme label supervision into the joint training of K-means and VAE-VQ. Leveraging HuBERT features, the method jointly optimizes a phoneme-constrained maximum-likelihood objective and a reconstruction loss, and is compatible with TDNN/Conformer decoders. On the UASpeech dataset, it achieves an absolute WER reduction of 1.77%, reaching a minimum of 23.25%. t-SNE visualizations reveal sharper token cluster boundaries, and phoneme purity metrics consistently improve. This work significantly enhances the modeling capability of unsupervised tokens for atypical articulation, establishing a novel, interpretable, and highly discriminative discrete representation paradigm for low-resource pathological speech recognition.
📝 Abstract
Discrete tokens extracted provide efficient and domain adaptable speech features. Their application to disordered speech that exhibits articulation imprecision and large mismatch against normal voice remains unexplored. To improve their phonetic discrimination that is weakened during unsupervised K-means or vector quantization of continuous features, this paper proposes novel phone-purity guided (PPG) discrete tokens for dysarthric speech recognition. Phonetic label supervision is used to regularize maximum likelihood and reconstruction error costs used in standard K-means and VAE-VQ based discrete token extraction. Experiments conducted on the UASpeech corpus suggest that the proposed PPG discrete token features extracted from HuBERT consistently outperform hybrid TDNN and End-to-End (E2E) Conformer systems using non-PPG based K-means or VAE-VQ tokens across varying codebook sizes by statistically significant word error rate (WER) reductions up to 0.99% and 1.77% absolute (3.21% and 4.82% relative) respectively on the UASpeech test set of 16 dysarthric speakers. The lowest WER of 23.25% was obtained by combining systems using different token features. Consistent improvements on the phone purity metric were also achieved. T-SNE visualization further demonstrates sharper decision boundaries were produced between K-means/VAE-VQ clusters after introducing phone-purity guidance.