Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the propensity of large language models (LLMs) to generate harmful content during inference, existing test-time detoxification methods suffer from coarse-grained and unstable interventions due to their failure to model fine-grained transitions from toxic to benign outputs. This paper proposes an autoregressive reward-guided latent-space representation editing framework. It is the first to explicitly model the toxicity state transition process in the latent space, converting sparse toxicity annotations into dense, token-level autoregressive reward signals via interpolation. A two-stage editing mechanism—combining directional semantic guidance with lightweight gradient optimization—enables both forward steering and differentiable dynamic adjustment. Evaluated on eight mainstream LLMs, our method reduces toxicity by 62.21% over SOTA approaches while cutting inference latency by 47.58%, all with negligible degradation in original model capabilities.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated impressive performance across various tasks, yet they remain vulnerable to generating toxic content, necessitating detoxification strategies to ensure safe and responsible deployment. Test-time detoxification methods, which typically introduce static or dynamic interventions into LLM representations, offer a promising solution due to their flexibility and minimal invasiveness. However, current approaches often suffer from imprecise interventions, primarily due to their insufficient exploration of the transition space between toxic and non-toxic outputs. To address this challenge, we propose extsc{A}utoregressive extsc{R}eward extsc{G}uided extsc{R}epresentation extsc{E}diting (ARGRE), a novel test-time detoxification framework that explicitly models toxicity transitions within the latent representation space, enabling stable and precise reward-guided editing. ARGRE identifies non-toxic semantic directions and interpolates between toxic and non-toxic representations to reveal fine-grained transition trajectories. These trajectories transform sparse toxicity annotations into dense training signals, enabling the construction of an autoregressive reward model that delivers stable and precise editing guidance. At inference, the reward model guides an adaptive two-step editing process to obtain detoxified representations: it first performs directional steering based on expected reward gaps to shift representations toward non-toxic regions, followed by lightweight gradient-based refinements. Extensive experiments across 8 widely used LLMs show that ARGRE significantly outperforms leading baselines in effectiveness (-62.21% toxicity) and efficiency (-47.58% inference time), while preserving the core capabilities of the original model with minimal degradation. Our code is available at the website.
Problem

Research questions and friction points this paper is trying to address.

Detoxifying toxic content generated by Large Language Models
Addressing imprecise interventions in test-time detoxification methods
Modeling toxicity transitions for stable reward-guided editing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive reward model guides representation editing
Interpolates toxic and non-toxic representations for transitions
Adaptive two-step editing with directional steering and refinements
🔎 Similar Papers
No similar papers found.