🤖 AI Summary
To mitigate catastrophic forgetting induced by parameter drift during large language model (LLM) fine-tuning—under the restrictive setting where only model weights are accessible, with no access to original training data or training recipes—this work proposes a context-agnostic synthetic data generation method. It constructs an approximately unbiased data distribution via unconditional autoregressive sampling and regularizes the output distribution shift between old and new models using KL divergence. We introduce, for the first time, the “context-agnostic generation” mechanism and systematically demonstrate its superior forgetting mitigation over context-dependent synthesis or pretraining subset replay. We further uncover principled effects of key design choices—such as temperature scaling and synthetic-to-real data ratio—on forgetting suppression. Experiments on OLMo-1B and R1-Distill-Llama-8B show that our method substantially preserves zero-shot generalization and chain-of-thought reasoning capabilities, achieving significantly lower forgetting rates than all baselines.
📝 Abstract
Fine-tuning a language model often results in a degradation of its existing performance on other tasks, due to a shift in the model parameters; this phenomenon is often referred to as (catastrophic) forgetting. We are interested in mitigating this, in settings where we only have access to the model weights but no access to its training data/recipe. A natural approach is to penalize the KL divergence between the original model and the new one. Our main realization is that a simple process - which we term context-free generation - allows for an approximate unbiased estimation of this KL divergence. We show that augmenting a fine-tuning dataset with context-free generations mitigates forgetting, in two settings: (a) preserving the zero-shot performance of pretrained-only models, and (b) preserving the reasoning performance of thinking models. We show that contextual synthetic data, and even a portion of the pretraining data, are less effective. We also investigate the effect of choices like generation temperature, data ratios etc. We present our results for OLMo-1B for pretrained-only setting and R1-Distill-Llama-8B for the reasoning setting.