What Makes a Good Curriculum? Disentangling the Effects of Data Ordering on LLM Mathematical Reasoning

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates how curriculum learning—ordering training data by difficulty—affects large language models’ (LLMs) mathematical reasoning capabilities, addressing three core questions: when is curriculum learning effective? Is forward (easy-to-hard) or reverse (hard-to-easy) sequencing superior? And does efficacy depend on the difficulty metric employed? Method: We propose a five-dimensional difficulty decomposition framework—comprising problem intrinsic difficulty, model surprise, confidence margin, predictive uncertainty, and decision variability—and conduct controlled post-training and offline evaluation across multiple state-of-the-art LLMs. Results: No universally optimal curriculum strategy exists; effectiveness is highly contingent on both model capability and task requirements. Task-aligned curricula (e.g., easy-to-hard for reasoning) improve final performance and generalization, whereas curricula grounded in internal model states—particularly predictive uncertainty—significantly enhance confidence calibration and robustness.

Technology Category

Application Category

📝 Abstract
Curriculum learning (CL) - ordering training data from easy to hard - has become a popular strategy for improving reasoning in large language models (LLMs). Yet prior work employs disparate difficulty metrics and training setups, leaving open fundamental questions: When does curriculum help? Which direction - forward or reverse - is better? And does the answer depend on what we measure? We address these questions through a unified offline evaluation framework that decomposes curriculum difficulty into five complementary dimensions: Problem Difficulty, Model Surprisal, Confidence Margin, Predictive Uncertainty, and Decision Variability. Through controlled post-training experiments on mathematical reasoning benchmarks with Llama3.1-8B, Mistral-7B, and Gemma3-4B, we find that (i) no curriculum strategy dominates universally - the relative effectiveness of forward versus reverse CL depends jointly on model capability and task complexity; (ii) even within a single metric, samples at different difficulty levels produce distinct gains depending on task demands; and (iii) task-aligned curricula focus on shaping the model's final representations and generalization, whereas inner-state curricula modulate internal states such as confidence and uncertainty. Our findings challenge the notion of a universal curriculum strategy and offer actionable guidance across model and task regimes, with some metrics indicating that prioritizing decision-uncertain samples can further enhance learning outcomes.
Problem

Research questions and friction points this paper is trying to address.

Evaluating how data ordering affects LLM mathematical reasoning performance
Determining optimal curriculum direction based on model capability and task complexity
Identifying which difficulty metrics best enhance learning outcomes in curricula
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates curriculum learning via five difficulty dimensions
Tests forward versus reverse strategies across model capabilities
Prioritizes decision-uncertain samples to enhance learning outcomes