Time vs. Layer: Locating Predictive Cues for Dysarthric Speech Descriptors in wav2vec 2.0

📅 2026-04-23
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
This study addresses the unclear relationship between wav2vec 2.0 representations and specific dysarthric speech descriptors, despite the model’s strong performance in pathological speech analysis. It presents the first systematic comparison of layer-wise and time-wise aggregation strategies, augmented with attention-based statistical pooling, to regress five articulatory impairment descriptors on the Speech Accessibility Project dataset. Results reveal that intelligibility is best predicted using deep-layer representations, whereas consonant imprecision, vocal roughness, and monoloudness benefit more from temporal modeling. In contrast, inappropriate silences show no clear advantage for either strategy. These findings demonstrate that distinct dimensions of dysarthric speech differentially rely on structural aspects of self-supervised representations, offering crucial insights for optimizing such models in clinical applications.

Technology Category

Application Category

📝 Abstract
Wav2vec 2.0 (W2V2) has shown strong performance in pathological speech analysis by effectively capturing the characteristics of atypical speech. Despite its success, it remains unclear which components of its learned representations are most informative for specific downstream tasks. In this study, we address this question by investigating the regression of dysarthric speech descriptors using annotations from the Speech Accessibility Project dataset. We focus on five descriptors, each addressing a different aspect of speech or voice production: intelligibility, imprecise consonants, inappropriate silences, harsh voice and monoloudness. Speech representations are derived from a W2V2-based feature extractor, and we systematically compare layer-wise and time-wise aggregation strategies using attentive statistics pooling. Our results show that intelligibility is best captured through layer-wise representations, whereas imprecise consonants, harsh voice and monoloudness benefit from time-wise modeling. For inappropriate silences, no clear advantage could be observed for either approach.
Problem

Research questions and friction points this paper is trying to address.

dysarthric speech
wav2vec 2.0
speech descriptors
representation analysis
pathological speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

wav2vec 2.0
dysarthric speech descriptors
layer-wise aggregation
time-wise aggregation
attentive statistics pooling
🔎 Similar Papers
No similar papers found.