🤖 AI Summary
This study investigates how sentence-level constructional distributions—specifically, the ratio of simple to complex sentences and full utterances to fragmented expressions—in German child-directed speech affect formal syntactic versus lexical acquisition in small-scale language models (German BabyLMs). Method: Grounded in developmental psychology principles, we construct ecologically valid, age-appropriate training corpora with systematically manipulated construction frequencies, and evaluate model performance along two dimensions: learning trajectory dynamics and final accuracy. Contribution/Results: Syntactic learning benefits significantly from exposure to complex sentences, whereas lexical learning is more efficient under high fragmentation. Crucially, variations in constructional distribution exert negligible effects on final syntactic accuracy or global learning trajectories, indicating a fundamental dissociation in optimization mechanisms between syntax and vocabulary acquisition. This work provides the first causal, empirical decomposition of constructional structure and linguistic competence development within the BabyLM framework, establishing a novel paradigm for modeling language acquisition and advancing neuro-symbolic interface research.
📝 Abstract
We analyze the influence of utterance-level construction distributions in German child-directed speech on the resulting formal linguistic competence and the underlying learning trajectories for small language models trained on a novel collection of developmentally plausible language data for German. We find that trajectories are surprisingly robust for markedly different distributions of constructions in the training data, which have little effect on final accuracies and almost no effect on global learning trajectories. While syntax learning benefits from more complex utterances, lexical learning culminates in better scores with more fragmentary data. We argue that LMs trained on developmentally plausible data can contribute to debates on how rich or impoverished linguistic stimuli actually are.