BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing speech models exhibit limited performance on long-duration child recordings, primarily due to reliance on adult-directed, high-fidelity training data that fails to capture the acoustic and linguistic variability of children’s speech. To address this, we propose the first child-centric, multilingual self-supervised speaker diarization method. Our approach leverages over 13,000 hours of real-world, multilingual child speech spanning 40+ languages to pretrain an enhanced HuBERT architecture. Key contributions include: (i) establishing the first large-scale, in-the-wild, multilingual foundational model for child speech; and (ii) tailoring downstream speaker discrimination to distinguish target children, female/male adults, and other children with fine-grained accuracy. Evaluated across six diverse datasets, our method achieves F1 scores of 52.1%–74.4%. Notably, it outperforms standard HuBERT by +13.2 and +15.9 percentage points on Vanuatu and Solomon Islands language corpora, respectively—marking significant progress in child speech analysis for low-resource languages.

Technology Category

Application Category

📝 Abstract

Child-centered long-form recordings are essential for studying early language development, but existing speech models trained on clean adult data perform poorly due to acoustic and linguistic differences. We introduce BabyHuBERT, the first self-supervised speech representation model trained on 13,000 hours of multilingual child-centered long-form recordings spanning over 40 languages. We evaluate BabyHuBERT on speaker segmentation, identifying when target children speak versus female adults, male adults, or other children -- a fundamental preprocessing step for analyzing naturalistic language experiences. BabyHuBERT achieves F1-scores from 52.1% to 74.4% across six diverse datasets, consistently outperforming W2V2-LL4300 (trained on English long-forms) and standard HuBERT (trained on clean adult speech). Notable improvements include 13.2 absolute F1 points over HuBERT on Vanuatu and 15.9 points on Solomon Islands corpora, demonstrating effectiveness on underrepresented languages. By sharing code and models, BabyHuBERT serves as a foundation model for child speech research, enabling fine-tuning on diverse downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

Speaker segmentation in child-centered long-form recordings

Overcoming acoustic differences between child and adult speech

Multilingual model performance on underrepresented languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised learning on child-centered recordings

Multilingual training across 40+ languages

Speaker segmentation outperforms existing adult-trained models

🔎 Similar Papers

No similar papers found.