🤖 AI Summary
This work identifies a phase-specific “over-memorization” phenomenon in fine-tuning large language models (LLMs) for reasoning tasks: during certain training stages, models maintain high test accuracy while exhibiting sharply degraded out-of-distribution generalization, robustness, and generation diversity—and anomalously elevated test perplexity. Through systematic analysis of learning dynamics across multiple tasks, models (e.g., LLaMA, Qwen), and fine-tuning paradigms (LoRA, full-parameter tuning), we formally define this phenomenon and empirically establish its prevalence. We attribute over-memorization to local overfitting of over-parameterized models to training data—a failure mode invisible to conventional accuracy metrics. To address it, we propose mitigation strategies grounded in checkpoint selection and learning rate scheduling. Our contributions include the first formal characterization of over-memorization in LLM reasoning fine-tuning, rigorous empirical validation across diverse settings, and practical, reproducible diagnostic tools and guidelines for robust LLM adaptation.
📝 Abstract
The pretrained large language models (LLMs) are finetuned with labeled data for better instruction following ability and alignment with human values. In this paper, we study the learning dynamics of LLM finetuning on reasoning tasks and reveal the uncovered over-memorization phenomenon during a specific stage of LLM finetuning. At this stage, the LLMs have excessively memorized training data and exhibit high test perplexity while maintaining good test accuracy. We investigate the conditions that lead to LLM over-memorization and find that training epochs and large learning rates contribute to this issue. Although models with over-memorization demonstrate comparable test accuracy to normal models, they suffer from reduced robustness, poor out-of-distribution generalization, and decreased generation diversity. Our experiments unveil the over-memorization to be broadly applicable across different tasks, models, and finetuning methods. Our research highlights that overparameterized, extensively finetuned LLMs exhibit unique learning dynamics distinct from traditional machine learning models. Based on our observations of over-memorization, we provide recommendations on checkpoint and learning rate selection during finetuning.