Two-Stage Adaptation for Non-Normative Speech Recognition: Revisiting Speaker-Independent Initialization for Personalization

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited effectiveness of direct speaker-specific fine-tuning (SS-FT) of general-purpose pre-trained automatic speech recognition (ASR) models on highly variable disordered speech, such as that associated with dysarthria or aphasia. To overcome this challenge, the authors propose a two-stage adaptation framework: first performing speaker-independent fine-tuning (SI-FT) on multi-speaker disordered speech data, followed by speaker-specific fine-tuning. This study provides the first systematic validation of SI-FT as an effective initialization strategy for personalization. Evaluated on disordered speech benchmarks including AphasiaBank and UA-Speech, the approach significantly improves recognition accuracy, while incurring only controlled performance degradation on out-of-domain canonical speech datasets such as TED-LIUM v3 and FLEURS. Experiments using Whisper-Large-v3 and Qwen3-ASR consistently demonstrate the superiority of the two-stage strategy over direct SS-FT.

Technology Category

Application Category

📝 Abstract
Personalizing automatic speech recognition (ASR) systems for non-normative speech, such as dysarthric and aphasic speech, is challenging. While speaker-specific fine-tuning (SS-FT) is widely used, it is typically initialized directly from a generic pre-trained model. Whether speaker-independent adaptation provides a stronger initialization prior under such mismatch remains unclear. In this work, we propose a two-stage adaptation framework consisting of speaker-independent fine-tuning (SI-FT) on multi-speaker non-normative data followed by SS-FT, and evaluate it through a controlled comparison with direct SS-FT under identical per-speaker conditions. Experiments on AphasiaBank and UA-Speech with Whisper-Large-v3 and Qwen3-ASR, alongside evaluation on typical-speech datasets TED-LIUM v3 and FLEURS, show that two-stage adaptation consistently improves personalization while maintaining manageable out-of-domain (OOD) trade-offs.
Problem

Research questions and friction points this paper is trying to address.

non-normative speech
personalization
speaker-independent adaptation
automatic speech recognition
dysarthric speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

two-stage adaptation
speaker-independent fine-tuning
non-normative speech recognition
personalized ASR
domain adaptation
🔎 Similar Papers
No similar papers found.
S
Shan Jiang
Leiden Institute of Advanced Computer Science, Leiden University, The Netherlands
J
Jiawen Qi
Leiden Institute of Advanced Computer Science, Leiden University, The Netherlands
C
Chuanbing Huo
Penobscot Community Health Care, United States
Y
Yingqiang Gao
Department of Computational Linguistics, University of Zurich, Switzerland
Qinyu Chen
Qinyu Chen
Assistant Professor, Leiden University
Edge AIIC designNeuromorphic ComputingEvent-based visionAR/VR