🤖 AI Summary
Standard supervised fine-tuning (SFT) of language models on rigorous mathematical reasoning benchmarks (e.g., MATH) suffers from performance saturation. Method: This paper proposes STAT, a skill-oriented adaptive training framework leveraging the metacognitive capabilities of a strong teacher model. STAT automatically decomposes tasks into fine-grained skills, identifies student model deficiencies at the skill level, and constructs a “missing-skill profile.” It then introduces two adaptive mechanisms: dynamic sample reweighting (STAT-Sel) and skill-guided data synthesis (STAT-Syn). STAT integrates seamlessly into both SFT and GRPO-based multi-stage optimization pipelines. Contribution/Results: Experiments on Llama and Qwen demonstrate absolute accuracy gains of +7.5% on MATH and +4.6% average improvement on out-of-distribution benchmarks—including AIME24/25 and AMC23—significantly outperforming baselines. These results validate the effectiveness and generalizability of skill-aware adaptive training.
📝 Abstract
Language models often show little to no improvement (i.e., "saturation") when trained via vanilla supervised fine-tuning (SFT) on data similar to what they saw in their training set (e.g., MATH). We introduce a new fine-tuning strategy, STAT, to train such a student model by using the metacognition ability of a stronger large language model (LLM) as the teacher. The teacher uses the task dataset to create a list of skills needed for the task, and then labels each data point with its required skills (Didolkar et al., 2024). By monitoring the student's answers, the teacher creates a Missing-Skill-Profile for the student, tracking how often they failed to apply each skill in their responses. We use this idea to build a modified training set in one of two ways. In STAT-Sel, the teacher uses an existing set of training examples but adaptively reweights them according to the Missing-Skill-Profile. In STAT-Syn, the teacher synthesizes additional examples involving missing skills. Across extensive experiments on Llama and Qwen models, our methods yield improvements of up to 7.5% on MATH, whereas SFT provides only limited gains. Furthermore, STAT enhances performance on out-of-distribution benchmarks (e.g., AIME24/25, AMC23, etc.) by an average of 4.6%. Crucially, we find that STAT is complementary to RL via GRPO (Shao et al., 2024): after the model is improved using STAT to address skill gaps, GRPO continues to add further gains. We conclude that skill-targeted adaptive training should broadly improve current training pipelines. Our code is available at: https://github.com/princeton-pli/STAT.