Evaluation of state-of-the-art ASR Models in Child-Adult Interactions

📅 2024-09-24

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This study addresses the poor automatic speech recognition (ASR) performance of mainstream foundation models—Whisper-large, Wav2Vec2, HuBERT, and WavLM—in autism clinical diagnostics, specifically for child–adult conversational speech. Systematic evaluation on real-world child speech reveals a significant 15–20% increase in word error rate (WER), exposing fundamental limitations in modeling child-specific acoustic characteristics. To overcome this, we propose a low-resource adaptation framework based on Low-Rank Adaptation (LoRA), enabling efficient fine-tuning with limited annotated child speech data. Experiments demonstrate that LoRA reduces WER by ~8% on child speech and ~13% on adult speech, preserving both robustness and cross-age generalization. This work establishes a reproducible methodological paradigm and empirical benchmark for adapting clinical speech AI to pediatric populations.

Technology Category

Application Category

📝 Abstract

The ability to reliably transcribe child-adult conversations in a clinical setting is valuable for diagnosis and understanding of numerous developmental disorders such as Autism Spectrum Disorder. Recent advances in deep learning architectures and availability of large scale transcribed data has led to development of speech foundation models that have shown dramatic improvements in ASR performance. However, the ability of these models to translate well to conversational child-adult interactions is under studied. In this work, we provide a comprehensive evaluation of ASR performance on a dataset containing child-adult interactions from autism diagnostic sessions, using Whisper, Wav2Vec2, HuBERT, and WavLM. We find that speech foundation models show a noticeable performance drop (15-20% absolute WER) for child speech compared to adult speech in the conversational setting. Then, we employ LoRA on the best performing zero shot model (whisper-large) to probe the effectiveness of fine-tuning in a low resource setting, resulting in ~8% absolute WER improvement for child speech and ~13% absolute WER improvement for adult speech.

Problem

Research questions and friction points this paper is trying to address.

Evaluate ASR models on child-adult autism diagnostic conversations

Assess performance gap between child and adult speech transcription

Improve child speech recognition via low-resource fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated speech models on child-adult conversations

Fine-tuned Whisper-large using LoRA

Improved WER for child and adult speech

🔎 Similar Papers

Personalized Speech Recognition for Children with Test-Time Adaptation