🤖 AI Summary
Limited child speech data severely degrades the transferability of adult-pretrained speaker verification models, primarily due to acoustic distribution shift and insufficient fine-tuning. To address this, we propose a cross-domain knowledge transfer framework featuring gated linear unit (GLU)-based adapters and staged iterative fine-tuning, enabling efficient, lightweight adaptation of mainstream embedding backbones—including ECAPA-TDNN, ResNet, and x-vector. The GLU adapter explicitly models child-specific acoustic representations, while joint optimization of the pretrained backbone, adapter, and classifier enhances generalization under low-resource conditions. Experiments on the OGI and MyST child speech datasets demonstrate consistent improvements: equal error rates (EER) decrease by 18.7%–26.3% over strong baselines. The framework exhibits both effectiveness and architecture-agnostic transferability across diverse speaker embedding models.
📝 Abstract
Speaker Verification (SV) systems trained on adults speech often underperform on children's SV due to the acoustic mismatch, and limited children speech data makes fine-tuning not very effective. In this paper, we propose an innovative framework, a Gated Linear Unit adapter with Iterative Fine-Tuning (G-IFT), to enhance knowledge transfer efficiency between the high-resource adults speech domain and the low-resource children's speech domain. In this framework, a Gated Linear Unit adapter is first inserted between the pre-trained speaker embedding model and the classifier. Then the classifier, adapter, and pre-trained speaker embedding model are optimized sequentially in an iterative way. This framework is agnostic to the type of the underlying architecture of the SV system. Our experiments on ECAPA-TDNN, ResNet, and X-vector architectures using the OGI and MyST datasets demonstrate that the G-IFT framework yields consistent reductions in Equal Error Rates compared to baseline methods.