Unveiling Over-Memorization in Finetuning LLMs for Reasoning Tasks

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work identifies a phase-specific “over-memorization” phenomenon in fine-tuning large language models (LLMs) for reasoning tasks: during certain training stages, models maintain high test accuracy while exhibiting sharply degraded out-of-distribution generalization, robustness, and generation diversity—and anomalously elevated test perplexity. Through systematic analysis of learning dynamics across multiple tasks, models (e.g., LLaMA, Qwen), and fine-tuning paradigms (LoRA, full-parameter tuning), we formally define this phenomenon and empirically establish its prevalence. We attribute over-memorization to local overfitting of over-parameterized models to training data—a failure mode invisible to conventional accuracy metrics. To address it, we propose mitigation strategies grounded in checkpoint selection and learning rate scheduling. Our contributions include the first formal characterization of over-memorization in LLM reasoning fine-tuning, rigorous empirical validation across diverse settings, and practical, reproducible diagnostic tools and guidelines for robust LLM adaptation.

Technology Category

Application Category

📝 Abstract

The pretrained large language models (LLMs) are finetuned with labeled data for better instruction following ability and alignment with human values. In this paper, we study the learning dynamics of LLM finetuning on reasoning tasks and reveal the uncovered over-memorization phenomenon during a specific stage of LLM finetuning. At this stage, the LLMs have excessively memorized training data and exhibit high test perplexity while maintaining good test accuracy. We investigate the conditions that lead to LLM over-memorization and find that training epochs and large learning rates contribute to this issue. Although models with over-memorization demonstrate comparable test accuracy to normal models, they suffer from reduced robustness, poor out-of-distribution generalization, and decreased generation diversity. Our experiments unveil the over-memorization to be broadly applicable across different tasks, models, and finetuning methods. Our research highlights that overparameterized, extensively finetuned LLMs exhibit unique learning dynamics distinct from traditional machine learning models. Based on our observations of over-memorization, we provide recommendations on checkpoint and learning rate selection during finetuning.

Problem

Research questions and friction points this paper is trying to address.

Study over-memorization in LLMs during finetuning

Identify causes like epochs and learning rates

Assess impacts on robustness and generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Study LLM finetuning dynamics on reasoning tasks

Identify over-memorization in specific finetuning stage

Recommend checkpoint and learning rate selection

🔎 Similar Papers

I Learn Better If You Speak My Language: Understanding the Superior Performance of Fine-Tuning Large Language Models with LLM-Generated Responses