🤖 AI Summary
The prevailing hypothesis that DPO reduces language model toxicity by merely suppressing a small set of “toxic neurons” is overly simplistic and lacks mechanistic grounding.
Method: We employ activation patching, toxicity probe projection, hierarchical clustering, and causal attribution analysis to dissect DPO’s internal mechanisms across model layers.
Contribution/Results: We reveal that DPO induces distributed, progressive activation shifts across numerous MLP neurons in multiple layers—not localized suppression. We identify four functionally distinct neuron groups—two detoxifying and two anti-toxicity-promoting—whose cumulative activation shifts account for 95.1% of the observed toxicity reduction; conventional “toxic neuron suppression” contributes only 4.9%. Full activation patching of these four groups fully restores DPO’s toxicity mitigation effect. This work establishes DPO as a distributed, multi-stage regulatory process, challenging localized attribution models and providing a new interpretability framework for alignment.
📝 Abstract
Safety fine-tuning algorithms are widely used to reduce harmful outputs in language models, but how they achieve this remain unclear. Studying the Direct Preference Optimization (DPO) algorithm for toxicity reduction, current explanations claim that DPO achieves this by dampening the activations of toxic MLP neurons. However, through activation patching, we show that this explanation is incomplete. Projections onto a toxicity probe's direction show that only 4.9% of toxicity reduction comes from dampened toxic neurons. Instead, DPO reduces toxicity through distributed activation shifts across a majority of neurons, progressively shifting MLP layer outputs away from toxicity. These shifts accumulate across four neuron groups: two reducing toxicity and two promoting anti-toxicity. Activation patching validates the cumulative roles of these groups, where patching all identified groups effectively replicates DPO's effects. These findings illustrate DPO's mechanism: it reduces toxicity by accumulating small activation shifts across many neurons throughout the layers. Our findings provide new mechanistic insights into how safety fine-tuning reduces harmful outputs in language models.