Probing Whisper for Dysarthric Speech in Detection and Assessment

📅 2025-10-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the representational mechanisms of the Whisper-Medium model for dysarthria detection and severity assessment. Addressing the limited interpretability of pre-trained models in pathological speech analysis, we propose a hierarchical probing framework: leveraging linear classifiers, we quantify information content across encoder layers using silhouette coefficient and mutual information, and systematically evaluate performance before and after fine-tuning under both single- and multi-task settings. Experiments reveal that encoder layers 13–15 exhibit the highest discriminative capacity for dysarthric features; remarkably, lightweight fine-tuning (e.g., adapter-based) achieves near-full-parameter fine-tuning performance—improving detection F1 by ≤1.2% and severity assessment quadratic weighted kappa by ≤0.04. These findings indicate that mid-level representations of large-scale speech pre-trained models are intrinsically well-suited for clinical pathological speech analysis, establishing a novel paradigm for interpretable, zero-shot or minimal-fine-tuning AI-assisted diagnosis without full model retraining.

Technology Category

Application Category

📝 Abstract
Large-scale end-to-end models such as Whisper have shown strong performance on diverse speech tasks, but their internal behavior on pathological speech remains poorly understood. Understanding how dysarthric speech is represented across layers is critical for building reliable and explainable clinical assessment tools. This study probes the Whisper-Medium model encoder for dysarthric speech for detection and assessment (i.e., severity classification). We evaluate layer-wise embeddings with a linear classifier under both single-task and multi-task settings, and complement these results with Silhouette scores and mutual information to provide perspectives on layer informativeness. To examine adaptability, we repeat the analysis after fine-tuning Whisper on a dysarthric speech recognition task. Across metrics, the mid-level encoder layers (13-15) emerge as most informative, while fine-tuning induces only modest changes. The findings improve the interpretability of Whisper's embeddings and highlight the potential of probing analyses to guide the use of large-scale pretrained models for pathological speech.
Problem

Research questions and friction points this paper is trying to address.

Probing Whisper model for dysarthric speech detection
Analyzing layer-wise embeddings for severity classification
Investigating model adaptability through fine-tuning experiments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probes Whisper encoder layers for dysarthric speech
Evaluates embeddings with linear classifiers and metrics
Identifies mid-level layers as most informative
🔎 Similar Papers
No similar papers found.