ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of a systematic benchmark for comprehensively understanding the diverse acoustic signals—ranging from physiological sounds and non-linguistic vocalizations to canonical syllables and spoken language—produced by children from birth through school age. To bridge this gap, we introduce ChildVox, the first multi-task audio benchmark spanning the entire developmental trajectory of childhood, integrating 17 datasets and over 20 subtasks to enable unified cross-corpus and cross-domain evaluation. By systematically evaluating self-supervised speech models, automatic speech recognition (ASR) systems, and large audio-language models on tasks such as physiological sound classification, vocalization modeling, and syllable recognition, we identify optimal model configurations that provide high-precision technical support for assessing child language development and tracking articulatory progression.
📝 Abstract
We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.
Problem

Research questions and friction points this paper is trying to address.

child speech
audio benchmark
developmental acoustics
vocalization characterization
speech modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

ChildVox
audio-language model
developmental speech benchmark
child vocalization
cross-corpus evaluation
🔎 Similar Papers