Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Existing LLM toxicity evaluation relies on single-label benchmarks, failing to capture the ambiguity and multidimensional nature of real-world prompts—leading to frequent false negatives and false positives—while fine-grained multi-label human annotation remains prohibitively expensive. Method: We propose a novel multi-label paradigm for toxicity detection, introducing three large-scale, expert-validated multi-label benchmarks—Q-A-MLL, R-A-MLL, and H-X-MLL—covering 15 fine-grained toxicity categories. We theoretically prove the superiority of pseudo-labeling over single-label supervision and leverage public data for cost-effective, high-quality labeling. Contribution/Results: Experiments demonstrate that our approach significantly outperforms strong baselines—including GPT-4o and DeepSeek—in multi-label toxicity classification accuracy and robustness. Our framework establishes a more realistic, scalable, and technically grounded evaluation infrastructure for LLM safety assessment.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have achieved impressive results across a range of natural language processing tasks, but their potential to generate harmful content has raised serious safety concerns. Current toxicity detectors primarily rely on single-label benchmarks, which cannot adequately capture the inherently ambiguous and multi-dimensional nature of real-world toxic prompts. This limitation results in biased evaluations, including missed toxic detections and false positives, undermining the reliability of existing detectors. Additionally, gathering comprehensive multi-label annotations across fine-grained toxicity categories is prohibitively costly, further hindering effective evaluation and development. To tackle these issues, we introduce three novel multi-label benchmarks for toxicity detection: extbf{Q-A-MLL}, extbf{R-A-MLL}, and extbf{H-X-MLL}, derived from public toxicity datasets and annotated according to a detailed 15-category taxonomy. We further provide a theoretical proof that, on our released datasets, training with pseudo-labels yields better performance than directly learning from single-label supervision. In addition, we develop a pseudo-label-based toxicity detection method. Extensive experimental results show that our approach significantly surpasses advanced baselines, including GPT-4o and DeepSeek, thus enabling more accurate and reliable evaluation of multi-label toxicity in LLM-generated content.

Problem

Research questions and friction points this paper is trying to address.

Current toxicity detectors fail to capture multi-dimensional toxic content

Single-label benchmarks cause biased evaluations with missed and false detections

Gathering comprehensive multi-label toxicity annotations is prohibitively expensive

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces three multi-label toxicity detection benchmarks

Provides theoretical proof for pseudo-label training superiority

Develops pseudo-label-based method outperforming advanced baselines

🔎 Similar Papers

No similar papers found.