Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study investigates whether fine-tuned large language models (LLMs) acquire the human-specific Knobe effect—a cognitive bias wherein moral evaluation influences judgments of intentionality—and identifies its neural underpinnings. Using layer-patching, a mechanistic interpretability technique, we systematically trace the origin of this bias across three open-source LLMs. We find the Knobe effect robustly emerges in specific intermediate layers (e.g., MLP blocks), localized to sparse, critical neuron activation patterns. Innovatively, we propose *activation patching*: selectively replacing activations in only 2–3 targeted layers of the pretrained model suffices to substantially suppress the bias—without fine-tuning or retraining. Empirical analysis confirms the bias is introduced during supervised fine-tuning, not inherent to pretraining. Our method achieves high-precision, low-overhead bias mitigation while preserving other model capabilities, establishing a novel paradigm for controllable alignment.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have been shown to internalize human-like biases during finetuning, yet the mechanisms by which these biases manifest remain unclear. In this work, we investigated whether the well-known Knobe effect, a moral bias in intentionality judgements, emerges in finetuned LLMs and whether it can be traced back to specific components of the model. We conducted a Layer-Patching analysis across 3 open-weights LLMs and demonstrated that the bias is not only learned during finetuning but also localized in a specific set of layers. Surprisingly, we found that patching activations from the corresponding pretrained model into just a few critical layers is sufficient to eliminate the effect. Our findings offer new evidence that social biases in LLMs can be interpreted, localized, and mitigated through targeted interventions, without the need for model retraining.

Problem

Research questions and friction points this paper is trying to address.

Investigating moral bias manifestation in finetuned LLMs

Localizing Knobe effect bias in specific model layers

Developing targeted interventions to mitigate bias without retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-Patching analysis across multiple open-weights LLMs

Localizing moral bias in specific layers of finetuned models

Patching pretrained activations to eliminate bias without retraining

🔎 Similar Papers

No similar papers found.