Unveiling Over-Memorization in Finetuning LLMs for Reasoning Tasks

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a phase-specific “over-memorization” phenomenon in fine-tuning large language models (LLMs) for reasoning tasks: during certain training stages, models maintain high test accuracy while exhibiting sharply degraded out-of-distribution generalization, robustness, and generation diversity—and anomalously elevated test perplexity. Through systematic analysis of learning dynamics across multiple tasks, models (e.g., LLaMA, Qwen), and fine-tuning paradigms (LoRA, full-parameter tuning), we formally define this phenomenon and empirically establish its prevalence. We attribute over-memorization to local overfitting of over-parameterized models to training data—a failure mode invisible to conventional accuracy metrics. To address it, we propose mitigation strategies grounded in checkpoint selection and learning rate scheduling. Our contributions include the first formal characterization of over-memorization in LLM reasoning fine-tuning, rigorous empirical validation across diverse settings, and practical, reproducible diagnostic tools and guidelines for robust LLM adaptation.

Technology Category

Application Category

📝 Abstract
The pretrained large language models (LLMs) are finetuned with labeled data for better instruction following ability and alignment with human values. In this paper, we study the learning dynamics of LLM finetuning on reasoning tasks and reveal the uncovered over-memorization phenomenon during a specific stage of LLM finetuning. At this stage, the LLMs have excessively memorized training data and exhibit high test perplexity while maintaining good test accuracy. We investigate the conditions that lead to LLM over-memorization and find that training epochs and large learning rates contribute to this issue. Although models with over-memorization demonstrate comparable test accuracy to normal models, they suffer from reduced robustness, poor out-of-distribution generalization, and decreased generation diversity. Our experiments unveil the over-memorization to be broadly applicable across different tasks, models, and finetuning methods. Our research highlights that overparameterized, extensively finetuned LLMs exhibit unique learning dynamics distinct from traditional machine learning models. Based on our observations of over-memorization, we provide recommendations on checkpoint and learning rate selection during finetuning.
Problem

Research questions and friction points this paper is trying to address.

Study over-memorization in LLMs during finetuning
Identify causes like epochs and learning rates
Assess impacts on robustness and generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Study LLM finetuning dynamics on reasoning tasks
Identify over-memorization in specific finetuning stage
Recommend checkpoint and learning rate selection
🔎 Similar Papers
No similar papers found.
Zhiwen Ruan
Zhiwen Ruan
Southern University of Science and Technology
NLPLLMs
Y
Yun Chen
Shanghai University of Finance and Economics
Y
Yutao Hou
Shanghai University of Finance and Economics
P
Peng Li
Tsinghua University
Y
Yang Liu
Tsinghua University
G
Guanhua Chen
Southern University of Science and Technology