Do Construction Distributions Shape Formal Language Learning In German BabyLMs?

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This study investigates how sentence-level constructional distributions—specifically, the ratio of simple to complex sentences and full utterances to fragmented expressions—in German child-directed speech affect formal syntactic versus lexical acquisition in small-scale language models (German BabyLMs). Method: Grounded in developmental psychology principles, we construct ecologically valid, age-appropriate training corpora with systematically manipulated construction frequencies, and evaluate model performance along two dimensions: learning trajectory dynamics and final accuracy. Contribution/Results: Syntactic learning benefits significantly from exposure to complex sentences, whereas lexical learning is more efficient under high fragmentation. Crucially, variations in constructional distribution exert negligible effects on final syntactic accuracy or global learning trajectories, indicating a fundamental dissociation in optimization mechanisms between syntax and vocabulary acquisition. This work provides the first causal, empirical decomposition of constructional structure and linguistic competence development within the BabyLM framework, establishing a novel paradigm for modeling language acquisition and advancing neuro-symbolic interface research.

Technology Category

Application Category

📝 Abstract

We analyze the influence of utterance-level construction distributions in German child-directed speech on the resulting formal linguistic competence and the underlying learning trajectories for small language models trained on a novel collection of developmentally plausible language data for German. We find that trajectories are surprisingly robust for markedly different distributions of constructions in the training data, which have little effect on final accuracies and almost no effect on global learning trajectories. While syntax learning benefits from more complex utterances, lexical learning culminates in better scores with more fragmentary data. We argue that LMs trained on developmentally plausible data can contribute to debates on how rich or impoverished linguistic stimuli actually are.

Problem

Research questions and friction points this paper is trying to address.

Influence of construction distributions on German BabyLMs

Effect of training data on syntax and lexical learning

Role of developmentally plausible data in language model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes construction distributions in German child-directed speech

Trains small language models on developmentally plausible data

Explores syntax and lexical learning with varied utterance complexity

🔎 Similar Papers

Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANNs