🤖 AI Summary
This study addresses why modern language models require substantially more data than children are exposed to in order to learn effectively. For the first time, small language models are systematically trained on real-scale child-directed input—specifically, the BabyView dataset capturing naturalistic language from 6- to 36-month-olds—to investigate how linguistic knowledge emerges from human-scale data. By evaluating model performance on syntactic, semantic, and world knowledge tasks and assessing input quality using linguistic metrics, the work demonstrates that data distributivity and interactivity critically influence learning efficiency. Models exhibit sensible scaling behavior on syntactic tasks but limited capacity on higher-order semantic tasks. Performance also varies across individual children’s language environments. Notably, model word-level likelihood correlates significantly with children’s vocabulary acquisition, offering novel evidence on the parallels and divergences between human language learning and artificial language modeling.
📝 Abstract
Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior. Assessing the nature and origins of this "data gap" requires benchmarking LMs on human-scale datasets to understand how linguistic knowledge emerges from children's natural training data. Using transcripts from the BabyView dataset (videos from children ages 6-36 months), we investigate (1) scaling performance at child-scale data regimes, (2) variability in model performance across datasets from different children's experiences and linguistic predictors of dataset quality, and (3) relationships between model and child language learning outcomes. LMs trained on child data show acceptable scaling for grammar tasks, but lower scaling on semantic and world knowledge tasks than models trained on synthetic data; we also observe substantial variability on data from different children. Beyond dataset size, performance is most associated with a combination of distributional and interactional linguistic features, broadly consistent with what makes high-quality input for child language development. Finally, model likelihoods for individual words correlate with children's learning of those words, suggesting that properties of child-directed input may influence both model learning and human language development. Overall, understanding what properties make language data efficient for learning can enable more powerful small-scale language models while also shedding light on human language acquisition.