Beyond Shallow Heuristics: Leveraging Human Intuition for Curriculum Learning

📅 2025-08-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of designing effective curriculum learning strategies for language model pretraining by leveraging human intuition about linguistic difficulty. We propose a label-driven curriculum framework grounded in human-annotated article-level “simplicity” labels from Simple Wikipedia, and systematically evaluate it on BERT-tiny. Unlike heuristic-based approaches (e.g., sentence length, word frequency) or dynamic, model-dependent curricula, our method achieves significant and consistent reductions in both overall and simplicity-specific perplexity—particularly improving the model’s ability to capture simple language structures. The key contribution is the first empirical demonstration that human linguistic intuition serves as a reliable, low-cost, and high-yield signal for curriculum design, offering an interpretable, easily deployable paradigm for curriculum learning in language modeling.

Technology Category

Application Category

📝 Abstract
Curriculum learning (CL) aims to improve training by presenting data from "easy" to "hard", yet defining and measuring linguistic difficulty remains an open challenge. We investigate whether human-curated simple language can serve as an effective signal for CL. Using the article-level labels from the Simple Wikipedia corpus, we compare label-based curricula to competence-based strategies relying on shallow heuristics. Our experiments with a BERT-tiny model show that adding simple data alone yields no clear benefit. However, structuring it via a curriculum -- especially when introduced first -- consistently improves perplexity, particularly on simple language. In contrast, competence-based curricula lead to no consistent gains over random ordering, probably because they fail to effectively separate the two classes. Our results suggest that human intuition about linguistic difficulty can guide CL for language model pre-training.
Problem

Research questions and friction points this paper is trying to address.

Defining and measuring linguistic difficulty in curriculum learning
Investigating human-curated simple language as signal for curriculum
Comparing label-based curricula with shallow heuristic strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-curated simple language for curriculum learning
Label-based curricula outperform shallow heuristics
Human intuition guides effective linguistic difficulty ordering
V
Vanessa Toborek
University of Bonn, Lamarr Institute
S
Sebastian Müller
University of Bonn, Lamarr Institute
T
Tim Selbach
University of Bonn
T
Tamás Horváth
University of Bonn, Lamarr Institute, Fraunhofer IAIS
Christian Bauckhage
Christian Bauckhage
Prof. of Computer Science, University of Bonn, Fraunhofer IAIS, LAMARR Institute
Pattern RecognitionMachine LearningQuantum ComputingComputer Games