Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This study investigates whether multilingual input leads to language development delays and how the structure of such input affects acquisition outcomes. To overcome the ethical and practical constraints of randomly assigning language environments in human studies, the authors simulate child bilingual acquisition using a small GPT-2–based language model trained on strictly controlled synthetic monolingual and bilingual corpora (100 million words each). Model performance is evaluated across multiple dimensions—including perplexity, grammaticality, and semantic knowledge—to assess the impact of different exposure patterns. Results indicate that the bilingual model achieves robust performance in both languages without significant learning delays, and no substantial differences emerge across varying bilingual input configurations. These findings support the hypothesis that statistical learning mechanisms are capable of efficiently processing and integrating multilingual input.

Technology Category

Application Category

📝 Abstract

Multilingualism is incredibly common around the world, leading to many important theoretical and practical questions about how children learn multiple languages at once. For example, does multilingual acquisition lead to delays in learning? Are there better and worse ways to structure multilingual input? Many correlational studies address these questions, but it is surprisingly difficult to get definitive answers because children cannot be randomly assigned to be multilingual and data are typically not matched between languages. We use language model training as a method for simulating a variety of highly controlled exposure conditions, and create matched 100M-word mono- and bilingual datasets using synthetic data and machine translation. We train GPT-2 models on monolingual and bilingual data organized to reflect a range of exposure regimes, and evaluate their performance on perplexity, grammaticality, and semantic knowledge. Across model scales and measures, bilingual models perform similarly to monolingual models in one language, but show strong performance in the second language as well. These results suggest that there are no strong differences between different bilingual exposure regimes, and that bilingual input poses no in-principle challenges for agnostic statistical learners.

Problem

Research questions and friction points this paper is trying to address.

multilingualism

language acquisition

bilingual exposure

language learning delays

input structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

bilingual language acquisition

controlled synthetic data

small-scale language models