$IF-GUIDE$: Influence Function-Guided Detoxification of LLMs

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of both explicit and implicit toxicity in large language models (LLMs) induced by training data. We propose a proactive, token-level detoxification method that improves influence functions to enable human-preference-free toxicity attribution, enabling precise identification and suppression of toxic tokens. Our approach introduces an unsupervised toxicity learning objective, a toxicity-aware document selection strategy, and a lightweight proxy model (with ~1M parameters), significantly enhancing scalability. The method is effective across both pretraining and fine-tuning stages: it reduces explicit toxicity by up to 10× compared to unfiltered baselines and implicit toxicity by 3× relative to state-of-the-art methods such as DPO and RAD, while substantially lowering computational overhead. To our knowledge, this is the first work to achieve proactive, token-level, attribution-driven detoxification—establishing a novel paradigm for efficient and scalable safe LLM training.

Technology Category

Application Category

📝 Abstract
We study how training data contributes to the emergence of toxic behaviors in large-language models. Most prior work on reducing model toxicity adopts $reactive$ approaches, such as fine-tuning pre-trained (and potentially toxic) models to align them with human values. In contrast, we propose a $proactive$ approach$-$IF-Guide$-$which leverages influence functions to identify harmful tokens within any training data and suppress their impact during training. To this end, we first show that standard influence functions are ineffective at discovering harmful training records. We then present a novel adaptation that measures token-level attributions from training data to model toxicity, along with techniques for selecting toxic training documents and a learning objective that can be integrated into both pre-training and fine-tuning. Moreover, IF-Guide does not rely on human-preference data, which is typically required by existing alignment methods. In evaluation, we demonstrate that IF-Guide substantially reduces both explicit and implicit toxicity$-$by up to 10$ imes$ compared to uncensored models, and up to 3$ imes$ compared to baseline alignment methods, e.g., DPO and RAD$-$across both pre-training and fine-tuning scenarios. IF-Guide is computationally efficient: a billion-parameter model is $not$ $necessary$ for computing influence scores; a million-parameter model$-$with 7.5$ imes$ fewer parameters$-$can effectively serve as a proxy for identifying harmful data.
Problem

Research questions and friction points this paper is trying to address.

Identifying harmful tokens in training data to reduce toxicity
Proposing a proactive approach using influence functions for detoxification
Reducing toxicity without relying on human-preference data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses influence functions to identify harmful tokens
Proactive training data detoxification without human preference
Efficient proxy model for computing influence scores