🤖 AI Summary
This study investigates whether syntactic category information can enhance the effectiveness of developmentally motivated curriculum learning in language model training and elucidates the explanatory role of syntactic knowledge in linguistic competence.
Method: Leveraging the BabyLM and CHILDES corpora, we systematically characterize syntactic distribution patterns across developmental stages of child language acquisition, and propose a cognitively inspired, syntax-aware curriculum strategy that dynamically selects training subsets based on syntactic complexity and token frequency.
Contribution/Results: Experiments demonstrate that syntax-classifiable subsets—contrasted with noisy full-data baselines—yield consistent improvements on downstream tasks such as reading comprehension (+3.2% average gain). Moreover, the curriculum progression closely aligns with empirically observed human language development trajectories. To our knowledge, this is the first work to integrate fine-grained syntactic structure modeling into a developmental curriculum learning framework, establishing a novel paradigm for interpretable, cognitively grounded language model training.
📝 Abstract
We examine the syntactic properties of BabyLM corpus, and age-groups within CHILDES. While we find that CHILDES does not exhibit strong syntactic differentiation by age, we show that the syntactic knowledge about the training data can be helpful in interpreting model performance on linguistic tasks. For curriculum learning, we explore developmental and several alternative cognitively inspired curriculum approaches. We find that some curricula help with reading tasks, but the main performance improvement come from using the subset of syntactically categorizable data, rather than the full noisy corpus.