🤖 AI Summary
This work addresses the lack of a systematic benchmark for comprehensively understanding the diverse acoustic signals—ranging from physiological sounds and non-linguistic vocalizations to canonical syllables and spoken language—produced by children from birth through school age. To bridge this gap, we introduce ChildVox, the first multi-task audio benchmark spanning the entire developmental trajectory of childhood, integrating 17 datasets and over 20 subtasks to enable unified cross-corpus and cross-domain evaluation. By systematically evaluating self-supervised speech models, automatic speech recognition (ASR) systems, and large audio-language models on tasks such as physiological sound classification, vocalization modeling, and syllable recognition, we identify optimal model configurations that provide high-precision technical support for assessing child language development and tracking articulatory progression.
📝 Abstract
We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.