Overtrained Language Models Are Harder to Fine-Tune

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work identifies a “catastrophic overtraining” phenomenon in large language model (LLM) pretraining: extending pretraining duration beyond an optimal point—e.g., increasing OLMo-1B’s pretraining from 2.3T to 3T tokens—induces significant downstream fine-tuning performance degradation (≥2% average drop on MMLU, BBH, and other benchmarks), challenging the prevailing assumption that more pretraining universally improves downstream efficacy. Through controlled pretraining experiments, theoretical analysis of parameter generalized sensitivity, and comprehensive multi-benchmark evaluation, the study systematically discovers, names, and attributes this phenomenon: excessive pretraining depth amplifies parameter sensitivity, thereby impairing fine-tuning adaptability. The findings call for a paradigm shift in pretraining objective design—explicitly incorporating downstream adaptability as a core optimization criterion, rather than prioritizing pretraining loss minimization alone.

Technology Category

Application Category

📝 Abstract

Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon catastrophic overtraining. For example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to over 2% worse performance on multiple standard LLM benchmarks than its 2.3T token counterpart. Through controlled experiments and theoretical analysis, we show that catastrophic overtraining arises from a systematic increase in the broad sensitivity of pre-trained parameters to modifications, including but not limited to fine-tuning. Our findings call for a critical reassessment of pre-training design that considers the downstream adaptability of the model.

Problem

Research questions and friction points this paper is trying to address.

Excessive pre-training harms fine-tuning performance

Catastrophic overtraining increases parameter sensitivity

Reassess pre-training for downstream adaptability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Challenges extended pre-training benefits

Identifies catastrophic overtraining phenomenon

Analyzes parameter sensitivity increase

🔎 Similar Papers

Amuro & Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models