Enhancing Age-Related Robustness in Children Speaker Verification

📅 2025-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Child speaker verification (C-SV) faces a fundamental challenge: rapid acoustic non-stationarity induced by physiological growth, severely degrading cross-age robustness. To address this, we propose the Feature Transformation Adapter (FTA), the first architecture to jointly model age progression by fusing local temporal dynamics with global semantic representations. We further introduce Synthetic Audio Augmentation (SAA), a novel data augmentation strategy that explicitly simulates inter-annual vocal tract and prosodic changes. Additionally, we construct C-Longitudinal—the first publicly available longitudinal child speech dataset specifically designed for cross-year evaluation. Within a deep embedding framework, FTA and SAA are jointly optimized. On one-, two-, and three-year cross-age verification tasks, our method achieves significant EER reductions of 19.4%, 13.0%, and 6.1%, respectively, demonstrating substantial improvements in age robustness.

Technology Category

Application Category

📝 Abstract
One of the main challenges in children's speaker verification (C-SV) is the significant change in children's voices as they grow. In this paper, we propose two approaches to improve age-related robustness in C-SV. We first introduce a Feature Transform Adapter (FTA) module that integrates local patterns into higher-level global representations, reducing overfitting to specific local features and improving the inter-year SV performance of the system. We then employ Synthetic Audio Augmentation (SAA) to increase data diversity and size, thereby improving robustness against age-related changes. Since the lack of longitudinal speech datasets makes it difficult to measure age-related robustness of C-SV systems, we introduce a longitudinal dataset to assess inter-year verification robustness of C-SV systems. By integrating both of our proposed methods, the average equal error rate was reduced by 19.4%, 13.0%, and 6.1% in the one-year, two-year, and three-year gap inter-year evaluation sets, respectively, compared to the baseline.
Problem

Research questions and friction points this paper is trying to address.

Improve children speaker verification robustness
Address voice changes with age
Enhance data diversity and representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feature Transform Adapter integration
Synthetic Audio Augmentation utilization
Longitudinal dataset introduction
🔎 Similar Papers
No similar papers found.