Detoxifying LLMs via Representation Erasure-Based Preference Optimization

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work addresses the persistent vulnerability of large language models to adversarial prompts and relearning attacks even after detoxification, as existing methods struggle to fully eliminate toxic directions in internal representations. The authors reformulate detoxification as a token-level preference optimization problem and propose Representation Erasure via Preference Optimization (REPO), a novel method that leverages a custom objective function to steer internal representations of toxic outputs toward those of their harmless counterparts. This approach enables fine-grained, localized neuron editing at the representation level—the first such technique to achieve precise toxicity erasure without compromising general model capabilities. REPO demonstrates superior robustness against both relearning attacks and enhanced GCG jailbreaking attacks, outperforming current detoxification strategies operating at either the output or representation level.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) trained on webscale data can produce toxic outputs, raising concerns for safe deployment. Prior defenses, based on applications of DPO, NPO, and similar algorithms, reduce the likelihood of harmful continuations, but not robustly so: they are vulnerable to adversarial prompting and easily undone by fine-tuning-based relearning attacks. Indeed, research has shown that these edits to the model are superficial: linear probing reveals that harmful"directions"remain present in representations. To address this, we propose Representation Erasure-based Preference Optimization (REPO), reformulating detoxification as a token-level preference problem. Using a novel objective with preference data, we force the representations of toxic continuations to converge toward their benign counterparts. Our mechanistic analysis reveals that this granular approach is critical: unlike baselines, REPO induces deep, localized edits to toxicity-encoding neurons while preserving general model utility. Exhaustive evaluations show that REPO achieves state-of-the-art robustness, stopping sophisticated threats-including relearning attacks and enhanced GCG jailbreaks-where existing representation- and output-based methods fail.

Problem

Research questions and friction points this paper is trying to address.

toxicity

large language models

adversarial prompting

relearning attacks

representation erasure

Innovation

Methods, ideas, or system contributions that make the work stand out.

representation erasure

preference optimization

toxicity mitigation