Robustly identifying concepts introduced during chat fine-tuning using crosscoders

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the concept misattribution problem in cross-encoder-based model difference analysis, where L1 loss erroneously attributes concepts already present in the base model to the fine-tuned model. To enhance causal validity and interpretability of concept attribution, we propose Latent Scaling—a diagnostic mechanism—and BatchTopK, a sparse loss function that replaces conventional L1 regularization. Our approach enables precise isolation of fine-tuning-induced behavioral changes. Applied systematically to Gemma 2 2B, it identifies high-fidelity, chat-specific concepts—including “misinformation detection,” “personal question identification,” and fine-grained refusal-triggering patterns—with unprecedented fidelity. Experiments demonstrate substantial improvements in accurately extracting novel behaviors introduced by fine-tuning, outperforming prior attribution methods. The framework establishes a reproducible, interpretable paradigm for behavioral attribution in large language models, advancing both mechanistic interpretability and safety-aware model analysis.

Technology Category

Application Category

📝 Abstract

Model diffing is the study of how fine-tuning changes a model's representations and internal algorithms. Many behaviours of interest are introduced during fine-tuning, and model diffing offers a promising lens to interpret such behaviors. Crosscoders are a recent model diffing method that learns a shared dictionary of interpretable concepts represented as latent directions in both the base and fine-tuned models, allowing us to track how concepts shift or emerge during fine-tuning. Notably, prior work has observed concepts with no direction in the base model, and it was hypothesized that these model-specific latents were concepts introduced during fine-tuning. However, we identify two issues which stem from the crosscoders L1 training loss that can misattribute concepts as unique to the fine-tuned model, when they really exist in both models. We develop Latent Scaling to flag these issues by more accurately measuring each latent's presence across models. In experiments comparing Gemma 2 2B base and chat models, we observe that the standard crosscoder suffers heavily from these issues. Building on these insights, we train a crosscoder with BatchTopK loss and show that it substantially mitigates these issues, finding more genuinely chat-specific and highly interpretable concepts. We recommend practitioners adopt similar techniques. Using the BatchTopK crosscoder, we successfully identify a set of genuinely chat-specific latents that are both interpretable and causally effective, representing concepts such as $ extit{false information}$ and $ extit{personal question}$, along with multiple refusal-related latents that show nuanced preferences for different refusal triggers. Overall, our work advances best practices for the crosscoder-based methodology for model diffing and demonstrates that it can provide concrete insights into how chat tuning modifies language model behavior.

Problem

Research questions and friction points this paper is trying to address.

Identify concepts introduced during chat fine-tuning using crosscoders

Address misattribution of concepts due to crosscoders L1 training loss

Improve interpretability and accuracy of model diffing for chat-specific behaviors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses crosscoders for model diffing analysis

Introduces Latent Scaling for accurate measurement

Employs BatchTopK loss for chat-specific concepts

🔎 Similar Papers

No similar papers found.