G-IFT: A Gated Linear Unit adapter with Iterative Fine-Tuning for Low-Resource Children's Speaker Verification

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Limited child speech data severely degrades the transferability of adult-pretrained speaker verification models, primarily due to acoustic distribution shift and insufficient fine-tuning. To address this, we propose a cross-domain knowledge transfer framework featuring gated linear unit (GLU)-based adapters and staged iterative fine-tuning, enabling efficient, lightweight adaptation of mainstream embedding backbones—including ECAPA-TDNN, ResNet, and x-vector. The GLU adapter explicitly models child-specific acoustic representations, while joint optimization of the pretrained backbone, adapter, and classifier enhances generalization under low-resource conditions. Experiments on the OGI and MyST child speech datasets demonstrate consistent improvements: equal error rates (EER) decrease by 18.7%–26.3% over strong baselines. The framework exhibits both effectiveness and architecture-agnostic transferability across diverse speaker embedding models.

Technology Category

Application Category

📝 Abstract

Speaker Verification (SV) systems trained on adults speech often underperform on children's SV due to the acoustic mismatch, and limited children speech data makes fine-tuning not very effective. In this paper, we propose an innovative framework, a Gated Linear Unit adapter with Iterative Fine-Tuning (G-IFT), to enhance knowledge transfer efficiency between the high-resource adults speech domain and the low-resource children's speech domain. In this framework, a Gated Linear Unit adapter is first inserted between the pre-trained speaker embedding model and the classifier. Then the classifier, adapter, and pre-trained speaker embedding model are optimized sequentially in an iterative way. This framework is agnostic to the type of the underlying architecture of the SV system. Our experiments on ECAPA-TDNN, ResNet, and X-vector architectures using the OGI and MyST datasets demonstrate that the G-IFT framework yields consistent reductions in Equal Error Rates compared to baseline methods.

Problem

Research questions and friction points this paper is trying to address.

Improve children's speaker verification with limited data

Address acoustic mismatch between adult and child speech

Enhance knowledge transfer from adult to child speech domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gated Linear Unit adapter for knowledge transfer

Iterative fine-tuning of model components

Agnostic to underlying speaker verification architecture

🔎 Similar Papers

Personalized Speech Recognition for Children with Test-Time Adaptation