🤖 AI Summary
Fine-tuning large language models (LLMs) often triggers out-of-distribution (OOD) spurious generalization, and existing mitigation strategies rely on augmenting training data—limiting their applicability. This paper proposes “concept ablation-guided generalization,” a novel paradigm that requires no training data modification. For the first time, it integrates interpretability analysis to identify deleterious concept directions in latent space responsible for OOD generalization, then applies precise linear projection-based ablation to neutralize them. The intervention operates directly in the latent space, requiring no samples from the target distribution. Evaluated across three fine-tuning tasks, our method reduces anomalous responses by an order of magnitude while preserving in-distribution performance. Our core contribution is the first concept-ablation-based framework for controllable OOD generalization during LLM fine-tuning—offering a data-agnostic, interpretable, and intervention-capable pathway toward aligning LLM behavior.
📝 Abstract
Fine-tuning large language models (LLMs) can lead to unintended out-of-distribution generalization. Standard approaches to this problem rely on modifying training data, for example by adding data that better specify the intended generalization. However, this is not always practical. We introduce Concept Ablation Fine-Tuning (CAFT), a technique that leverages interpretability tools to control how LLMs generalize from fine-tuning, without needing to modify the training data or otherwise use data from the target distribution. Given a set of directions in an LLM's latent space corresponding to undesired concepts, CAFT works by ablating these concepts with linear projections during fine-tuning, steering the model away from unintended generalizations. We successfully apply CAFT to three fine-tuning tasks, including emergent misalignment, a phenomenon where LLMs fine-tuned on a narrow task generalize to give egregiously misaligned responses to general questions. Without any changes to the fine-tuning data, CAFT reduces misaligned responses by 10x without degrading performance on the training distribution. Overall, CAFT represents a novel approach for steering LLM generalization without modifying training data.