🤖 AI Summary
This study addresses the poor automatic speech recognition (ASR) performance of mainstream foundation models—Whisper-large, Wav2Vec2, HuBERT, and WavLM—in autism clinical diagnostics, specifically for child–adult conversational speech. Systematic evaluation on real-world child speech reveals a significant 15–20% increase in word error rate (WER), exposing fundamental limitations in modeling child-specific acoustic characteristics. To overcome this, we propose a low-resource adaptation framework based on Low-Rank Adaptation (LoRA), enabling efficient fine-tuning with limited annotated child speech data. Experiments demonstrate that LoRA reduces WER by ~8% on child speech and ~13% on adult speech, preserving both robustness and cross-age generalization. This work establishes a reproducible methodological paradigm and empirical benchmark for adapting clinical speech AI to pediatric populations.
📝 Abstract
The ability to reliably transcribe child-adult conversations in a clinical setting is valuable for diagnosis and understanding of numerous developmental disorders such as Autism Spectrum Disorder. Recent advances in deep learning architectures and availability of large scale transcribed data has led to development of speech foundation models that have shown dramatic improvements in ASR performance. However, the ability of these models to translate well to conversational child-adult interactions is under studied. In this work, we provide a comprehensive evaluation of ASR performance on a dataset containing child-adult interactions from autism diagnostic sessions, using Whisper, Wav2Vec2, HuBERT, and WavLM. We find that speech foundation models show a noticeable performance drop (15-20% absolute WER) for child speech compared to adult speech in the conversational setting. Then, we employ LoRA on the best performing zero shot model (whisper-large) to probe the effectiveness of fine-tuning in a low resource setting, resulting in ~8% absolute WER improvement for child speech and ~13% absolute WER improvement for adult speech.