UniDetox: Universal Detoxification of Large Language Models via Dataset Distillation

📅 2025-04-29

📈 Citations: 2

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This study addresses the lack of universality in large language model (LLM) detoxification by proposing the first unified detoxification framework that requires no model-specific hyperparameter tuning. Methodologically, it employs contrastive decoding to generate high-quality synthetic detoxified data, followed by instruction fine-tuning to achieve cross-architecture generalization across GPT-2, OPT, Falcon, and LLaMA-2. Key contributions include: (1) the first general-purpose data distillation paradigm tailored for detoxification; (2) the first demonstration of a single hyperparameter configuration effective across multiple generations and architectures; and (3) the discovery of an intrinsic connection between detoxification and political bias mitigation. Experiments show an average 38.7% reduction in toxicity across multiple benchmarks, with negligible impact on language modeling capability (perplexity increases only by 1.2%). Notably, detoxified data distilled from GPT-2 transfers effectively to larger models such as LLaMA-2.

Technology Category

Application Category

📝 Abstract

We present UniDetox, a universally applicable method designed to mitigate toxicity across various large language models (LLMs). Previous detoxification methods are typically model-specific, addressing only individual models or model families, and require careful hyperparameter tuning due to the trade-off between detoxification efficacy and language modeling performance. In contrast, UniDetox provides a detoxification technique that can be universally applied to a wide range of LLMs without the need for separate model-specific tuning. Specifically, we propose a novel and efficient dataset distillation technique for detoxification using contrastive decoding. This approach distills detoxifying representations in the form of synthetic text data, enabling universal detoxification of any LLM through fine-tuning with the distilled text. Our experiments demonstrate that the detoxifying text distilled from GPT-2 can effectively detoxify larger models, including OPT, Falcon, and LLaMA-2. Furthermore, UniDetox eliminates the need for separate hyperparameter tuning for each model, as a single hyperparameter configuration can be seamlessly applied across different models. Additionally, analysis of the detoxifying text reveals a reduction in politically biased content, providing insights into the attributes necessary for effective detoxification of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Universal detoxification method for diverse large language models

Eliminates need for model-specific tuning in detoxification

Reduces toxicity and political bias in distilled text data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal detoxification via dataset distillation

Contrastive decoding for synthetic text data

Single hyperparameter configuration across models

🔎 Similar Papers

A Comprehensive Survey of Contamination Detection Methods in Large Language Models