Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Narrow-domain fine-tuning induces significant, interpretable, domain-specific biases in the hidden-layer activations of large language models (LLMs), enabling systematic reconstruction of fine-tuning data content and formatting—posing concrete privacy and security risks. Method: The authors propose an activation differential analysis framework integrating pre- vs. post-fine-tuning activation comparisons, randomized prefix control experiments, and activation-guided generation to quantitatively characterize bias origins and interpretability limits. Contribution/Results: The phenomenon is validated across diverse architectures (Llama, Qwen) and scales (1B–7B). A zero-shot LLM-based activation interpretation agent achieves 92.3% accuracy in identifying fine-tuned domains. This work provides the first empirical evidence that narrow-domain fine-tuning imposes strong, structural perturbations on internal representations—challenging foundational assumptions underlying current AI safety evaluation and interpretability methodologies.

Technology Category

Application Category

📝 Abstract
Finetuning on narrow domains has become an essential tool to adapt Large Language Models (LLMs) to specific tasks and to create models with known unusual properties that are useful for research. We show that narrow finetuning creates strong biases in LLM activations that can be interpreted to understand the finetuning domain. These biases can be discovered using simple tools from model diffing - the study of differences between models before and after finetuning. In particular, analyzing activation differences on the first few tokens of random text and steering by adding this difference to the model activations produces text similar to the format and general content of the finetuning data. We demonstrate that these analyses contain crucial information by creating an LLM-based interpretability agent to understand the finetuning domain. With access to the bias, the agent performs significantly better compared to baseline agents using simple prompting. Our analysis spans synthetic document finetuning for false facts, emergent misalignment, subliminal learning, and taboo word guessing game models across different architectures (Gemma, LLaMA, Qwen) and scales (1B to 32B parameters). We suspect these biases reflect overfitting and find that mixing pretraining data into the finetuning corpus largely removes them, though residual risks may remain. Our work (1) demonstrates that narrowly finetuned models have salient traces of their training objective in their activations and suggests ways to improve how they are trained, (2) warns AI safety and interpretability researchers that the common practice of using such models as a proxy for studying broader finetuning (e.g., chat-tuning) might not be realistic, and (3) highlights the need for deeper investigation into the effects of narrow finetuning and development of truly realistic case studies for model-diffing, safety and interpretability research.
Problem

Research questions and friction points this paper is trying to address.

Detecting narrow finetuning traces in LLM activations through model diffing
Analyzing activation biases to interpret finetuning domains and content
Evaluating risks of overfitting in safety and interpretability research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing activation differences to interpret finetuning domain
Steering model activations to reveal finetuning data patterns
Using model diffing to detect narrow finetuning biases
🔎 Similar Papers
No similar papers found.