Amuro & Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models

📅 2024-08-13
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the intrinsic synergy and trade-offs between pretraining and fine-tuning in large language models (LLMs). Methodologically, we propose a multi-stage fine-tuning analysis framework leveraging intermediate pretraining checkpoints, systematically evaluating capability improvement, adaptation to new knowledge, retention of prior knowledge, and prompt robustness across 18 diverse datasets. Key findings are: (1) continued pretraining implicitly enhances downstream fine-tuning performance; (2) fine-tuning yields substantial gains on weak-task capabilities but induces domain-specific knowledge forgetting; (3) fine-tuning exacerbates prompt sensitivity, whereas additional pretraining effectively mitigates this effect. Crucially, we quantitatively demonstrate the reversibility of both knowledge forgetting and prompt sensitivity—establishing that pretraining quality fundamentally bounds fine-tuning efficacy. Our work provides the first reproducible empirical guidelines and standardized evaluation protocols for optimizing the pretraining–fine-tuning pipeline.

Technology Category

Application Category

📝 Abstract
The development of large language models leads to the formation of a pre-train-then-align paradigm, in which the model is typically pre-trained on a large text corpus and undergoes a tuning stage to align the model with human preference or downstream tasks. In this work, we investigate the relationship between pre-training and fine-tuning by fine-tuning multiple intermediate pre-trained model checkpoints. Our results on 18 datasets suggest that i) continual pre-training improves the model in a latent way that unveils after fine-tuning; ii) with extra fine-tuning, the datasets that the model does not demonstrate capability gain much more than those that the model performs well during the pre-training stage; iii) although model benefits significantly through supervised fine-tuning, it may forget previously known domain knowledge and the tasks that are not seen during fine-tuning; iv) the model resembles high sensitivity to evaluation prompts after supervised fine-tuning, but this sensitivity can be alleviated by more pre-training.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Transfer Learning
Fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Pre-learning and Fine-tuning Strategy
Memory Retention and Sensitivity Reduction
🔎 Similar Papers