ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the lack of a systematic benchmark for comprehensively understanding the diverse acoustic signals—ranging from physiological sounds and non-linguistic vocalizations to canonical syllables and spoken language—produced by children from birth through school age. To bridge this gap, we introduce ChildVox, the first multi-task audio benchmark spanning the entire developmental trajectory of childhood, integrating 17 datasets and over 20 subtasks to enable unified cross-corpus and cross-domain evaluation. By systematically evaluating self-supervised speech models, automatic speech recognition (ASR) systems, and large audio-language models on tasks such as physiological sound classification, vocalization modeling, and syllable recognition, we identify optimal model configurations that provide high-precision technical support for assessing child language development and tracking articulatory progression.

📝 Abstract

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.

Problem

Research questions and friction points this paper is trying to address.

child speech

audio benchmark

developmental acoustics

vocalization characterization

speech modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

ChildVox

audio-language model

developmental speech benchmark

child vocalization

cross-corpus evaluation

🔎 Similar Papers

Personalized Speech Recognition for Children with Test-Time Adaptation

2024-09-19arXiv.orgCitations: 0