Evaluation of state-of-the-art ASR Models in Child-Adult Interactions

📅 2024-09-24
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the poor automatic speech recognition (ASR) performance of mainstream foundation models—Whisper-large, Wav2Vec2, HuBERT, and WavLM—in autism clinical diagnostics, specifically for child–adult conversational speech. Systematic evaluation on real-world child speech reveals a significant 15–20% increase in word error rate (WER), exposing fundamental limitations in modeling child-specific acoustic characteristics. To overcome this, we propose a low-resource adaptation framework based on Low-Rank Adaptation (LoRA), enabling efficient fine-tuning with limited annotated child speech data. Experiments demonstrate that LoRA reduces WER by ~8% on child speech and ~13% on adult speech, preserving both robustness and cross-age generalization. This work establishes a reproducible methodological paradigm and empirical benchmark for adapting clinical speech AI to pediatric populations.

Technology Category

Application Category

📝 Abstract
The ability to reliably transcribe child-adult conversations in a clinical setting is valuable for diagnosis and understanding of numerous developmental disorders such as Autism Spectrum Disorder. Recent advances in deep learning architectures and availability of large scale transcribed data has led to development of speech foundation models that have shown dramatic improvements in ASR performance. However, the ability of these models to translate well to conversational child-adult interactions is under studied. In this work, we provide a comprehensive evaluation of ASR performance on a dataset containing child-adult interactions from autism diagnostic sessions, using Whisper, Wav2Vec2, HuBERT, and WavLM. We find that speech foundation models show a noticeable performance drop (15-20% absolute WER) for child speech compared to adult speech in the conversational setting. Then, we employ LoRA on the best performing zero shot model (whisper-large) to probe the effectiveness of fine-tuning in a low resource setting, resulting in ~8% absolute WER improvement for child speech and ~13% absolute WER improvement for adult speech.
Problem

Research questions and friction points this paper is trying to address.

Evaluate ASR models on child-adult autism diagnostic conversations
Assess performance gap between child and adult speech transcription
Improve child speech recognition via low-resource fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated speech models on child-adult conversations
Fine-tuned Whisper-large using LoRA
Improved WER for child and adult speech
🔎 Similar Papers
No similar papers found.
A
Aditya Ashvin
Signal Analysis and Interpretation Laboratory, University of Southern California, USA
R
Rimita Lahiri
Signal Analysis and Interpretation Laboratory, University of Southern California, USA
Aditya Kommineni
Aditya Kommineni
University of Southern California
S
Somer Bishop
Department of Psychiatry, University of California, San Francisco, California, USA
C
Catherine Lord
Semel Institute of Neuroscience and Human Behavior, University of California, Los Angeles, USA
Sudarsana Reddy Kadiri
Sudarsana Reddy Kadiri
University of Southern California
Speech ProcessingBiomedical SignalsMultimodalityHealthcare InformaticsDeep Learning
S
Shrikanth S. Narayanan
Signal Analysis and Interpretation Laboratory, University of Southern California, USA