Probe-based Fine-tuning for Reducing Toxicity

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the failure of probes in probe-guided fine-tuning due to Goodhart’s Law—where probes, when optimized as objectives, lose reliability. We propose an alignment method that jointly suppresses toxicity and preserves probe fidelity. Our core innovation lies in synergistically integrating supervised fine-tuning (SFT) and direct preference optimization (DPO) for probe-guided training: DPO substantially outperforms SFT, achieving both effective toxicity reduction and high probe accuracy in detecting harmful representations. Crucially, only a lightweight probe retraining post-fine-tuning is required to restore >95% of the original probe accuracy—eliminating the need for complex probe ensembling. Experiments across multiple toxicity benchmarks demonstrate significant toxicity reduction while maintaining robust internal monitoring capability, thereby validating the feasibility and practicality of probe-guided alignment.

Technology Category

Application Category

📝 Abstract
Probes trained on model activations can detect undesirable behaviors like deception or biases that are difficult to identify from outputs alone. This makes them useful detectors to identify misbehavior. Furthermore, they are also valuable training signals, since they not only reward outputs, but also good internal processes for arriving at that output. However, training against interpretability tools raises a fundamental concern: when a monitor becomes a training target, it may cease to be reliable (Goodhart's Law). We propose two methods for training against probes based on Supervised Fine-tuning and Direct Preference Optimization. We conduct an initial exploration of these methods in a testbed for reducing toxicity and evaluate the amount by which probe accuracy drops when training against them. To retain the accuracy of probe-detectors after training, we attempt (1) to train against an ensemble of probes, (2) retain held-out probes that aren't used for training, and (3) retrain new probes after training. First, probe-based preference optimization unexpectedly preserves probe detectability better than classifier-based methods, suggesting the preference learning objective incentivizes maintaining rather than obfuscating relevant representations. Second, probe diversity provides minimal practical benefit - simply retraining probes after optimization recovers high detection accuracy. Our findings suggest probe-based training can be viable for certain alignment methods, though probe ensembles are largely unnecessary when retraining is feasible.
Problem

Research questions and friction points this paper is trying to address.

Reducing model toxicity using probe-based fine-tuning methods
Addressing reliability loss when probes become training targets
Evaluating probe accuracy retention through ensemble and retraining techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probe-based fine-tuning reduces model toxicity
Training against probe ensembles maintains detection accuracy
Retraining probes after optimization recovers high accuracy
🔎 Similar Papers
No similar papers found.