BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data

📅 2025-10-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the critical lack of developmental plausibility and cognitive interpretability in multilingual pretraining. We introduce the first developmentally interpretable, multilingual benchmark grounded in empirical child language acquisition (CLA) trajectories. Methodologically, we integrate principles from developmental psychology with multilingual data compilation techniques to construct a staged linguistic input dataset covering 45 languages—each scaled to the informational equivalence of 100 million English word tokens—and modeling natural language exposure from infancy to native proficiency. We accompany this resource with a standardized evaluation suite and baseline models. Our contributions are threefold: (1) the first cross-lingual, developmentally grounded pretraining data construction paradigm; (2) a reproducible resource for cognitive modeling and neurolinguistic validation of multilingual models; and (3) empirical evidence that our data significantly improves model performance on cross-lingual transfer and cognitive alignment tasks.

Technology Category

Application Category

📝 Abstract
We present BabyBabelLM, a multilingual collection of datasets modeling the language a person observes from birth until they acquire a native language. We curate developmentally plausible pretraining data aiming to cover the equivalent of 100M English words of content in each of 45 languages. We compile evaluation suites and train baseline models in each language. BabyBabelLM aims to facilitate multilingual pretraining and cognitive modeling.
Problem

Research questions and friction points this paper is trying to address.

Models language acquisition from birth data
Provides multilingual developmentally plausible training datasets
Facilitates cognitive modeling and multilingual pretraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual developmentally plausible pretraining data curation
45 languages with 100M English word equivalents
Evaluation suites and baseline models for cognitive modeling
🔎 Similar Papers
No similar papers found.
Jaap Jumelet
Jaap Jumelet
University of Groningen
Natural Language ProcessingExplainable AIComputational Linguistics
Abdellah Fourtassi
Abdellah Fourtassi
Aix Marseille University
A
Akari Haga
Nara Institute of Science and Technology
Bastian Bunzeck
Bastian Bunzeck
Bielefeld University
language acquisitionusage-based linguisticsdevelopmental AIconnectionismCxG
Bhargav Shandilya
Bhargav Shandilya
University of Colorado Boulder
Computational Linguistics
Diana Galvan-Sosa
Diana Galvan-Sosa
Research Associate, University of Cambridge
EducationalNLPInformation ExtractionClinicalNLPMachine Reading Comprehension
Faiz Ghifari Haznitrama
Faiz Ghifari Haznitrama
Ph.D. Student, KAIST School of Computing
NLPLow-Resource NLP
F
Francesca Padovani
University of Groningen
Francois Meyer
Francois Meyer
PhD student, University of Cape Town
Natural Language ProcessingMachine Learning
Hai Hu
Hai Hu
City University of Hong Kong
computational linguisticsnatural language inferenceChinese linguisticscorpus annotation
Julen Etxaniz
Julen Etxaniz
PhD Student in NLP, HiTZ, University of the Basque Country
MultilingualityNLPDLMLAI
Laurent Prévot
Laurent Prévot
Professor. Laboratoire Parole et Langage, CNRS & Aix Marseille Univ (France)
DiscourseDialogueNatural Language ProcessingComputational LinguisticsLanguage Resources
L
Linyang He
Columbia University
María Grandury
María Grandury
SomosNLP / Polytechnical University of Madrid
Natural Language ProcessingLLM Evaluation
M
Mila Marcheva
University of Cambridge
N
Negar Foroutan
EPFL
N
Nikitas Theodoropoulos
Independent Researcher
Pouya Sadeghi
Pouya Sadeghi
Computer Science student, University of Waterloo
Siyuan Song
Siyuan Song
Associate Professor, Arizona State University
Construction Safety and HealthWorkforce DevelopmentAI in ConstructionEngineering Education
Suchir Salhan
Suchir Salhan
University of Cambridge
Machine LearningLanguage ModelsNatural Language ProcessingLinguisticsCognitive Science
S
Susana Zhou
SomosNLP
Yurii Paniv
Yurii Paniv
CompSci PhD Student, Ukrainian Catholic University
data-efficient traininglarge language modelsmachine translationnatural language processing
Ziyin Zhang
Ziyin Zhang
Shanghai Jiao Tong University
Artificial IntelligenceNatural Language ProcessingLarge Language Models
Arianna Bisazza
Arianna Bisazza
Associate Professor, University of Groningen
Natural Language ProcessingMultilingual NLPInterpretabilityLanguage Learning in Humans vs Mach
Alex Warstadt
Alex Warstadt
Assistant Professor, UC San Diego
Computational linguisticsNatural language processingPragmaticsLanguage acquisition