🤖 AI Summary
This work addresses the challenge of quantifying skill evolution and forgetting in continual learning for language models. We propose the first developmental psychology–inspired evaluation framework, structuring assessment across five stages aligned with cognitive development trajectories of children aged 5–10 years. The framework introduces an interpretable skill graph modeling hierarchical dependencies among linguistic and reasoning abilities, and a large-scale synthetic dataset (23.4B tokens) with controlled lexical complexity and diverse formatting. It is the first to systematically integrate human developmental theory into LLM continual learning evaluation, enabling fine-grained analysis of forward/backward transfer and skill forgetting. Experiments on a 135M-parameter Transformer demonstrate that the framework effectively exposes trade-offs among skill retention and transfer under distinct training paradigms—task-isolated, joint, and sequential—thereby establishing a reproducible, interpretable, and developmentally grounded benchmark for continual learning.
📝 Abstract
We introduce a comprehensive continual learning dataset and benchmark (CurlL) grounded in human developmental trajectories from ages 5-10, enabling systematic and fine-grained assessment of models' ability to progressively acquire new skills. CurlL spans five developmental stages (0-4) covering ages 5-10, supported by a skill graph that breaks down broad skills into smaller abilities, concrete goals, and measurable indicators, while also capturing which abilities build on others. We generate a 23.4B-token synthetic dataset with controlled skill progression, vocabulary complexity, and format diversity, comprising paragraphs, comprehension-based QA (CQA), skill-testing QA (CSQA), and instruction-response (IR) pairs. Stage-wise token counts range from 2.12B to 6.78B tokens, supporting precise analysis of forgetting, forward transfer, and backward transfer. Using a 135M-parameter transformer trained under independent, joint, and sequential (continual) setups, we show trade-offs in skill retention and transfer efficiency. By mirroring human learning patterns and providing fine-grained control over skill dependencies, this work advances continual learning evaluations for language models.