🤖 AI Summary
This work addresses the critical shortage and excessive workload of speech-language pathologists (SLPs) in diagnosing childhood speech sound disorder (SSD) by proposing a cascaded multi-task classification framework based on speech representation models (SRMs). The approach progressively refines predictions—from binary classification to fine-grained SSD subtypes and symptom identification—on the SLP-Helm-UltraSuitePlus benchmark. By integrating targeted data augmentation with automatic speech recognition (ASR) techniques, the method effectively mitigates bias inherent in existing approaches. Experimental results demonstrate that the proposed framework significantly outperforms current state-of-the-art methods leveraging multimodal large language models across all clinical diagnostic tasks. To foster further research, the authors have publicly released both the trained models and source code.
📝 Abstract
Speech Sound Disorders (SSD) affect roughly five percent of children, yet speech-language pathologists face severe staffing shortages and unmanageable caseloads. We test a hierarchical approach to SSD classification on the granular multi-task SLPHelmUltraSuitePlus benchmark. We propose a cascading approach from binary classification to type, and symptom classification. By fine-tuning Speech Representation Models (SRM), and using targeted data augmentation we mitigate biases found by previous works, and improve upon all clinical tasks in the benchmark. We also treat Automatic Speech Recognition (ASR) with our data augmentation approach. Our results demonstrate that SRM consistently outperform the LLM-based state-of-the-art across all evaluated tasks by a large margin. We publish our models and code to foster future research.