Semi-Supervised Learning for Large Language Models Safety and Content Moderation

📅 2025-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address bottlenecks in LLM safety classification—including scarcity of high-quality labeled data, substantial bias in synthetic data, and frequent human annotation errors—this paper proposes a semi-supervised training paradigm tailored for safety auditing tasks. Methodologically, it integrates the UDA and FixMatch frameworks while introducing two key innovations: (1) the first systematic application of task-customized text augmentation—specifically, semantics-preserving adversarial perturbations and safety-intent rewriting—to LLM safety classification; and (2) unified modeling of risk identification across both prompt and response stages, enhanced by multi-stage consistency regularization. Evaluated on five mainstream safety benchmark datasets, the method achieves an average 12.7% F1-score improvement over supervised baselines and surpasses them even when trained on only 10% of labeled data. This significantly reduces dependence on both labeling scale and annotation quality.

Technology Category

Application Category

📝 Abstract
Safety for Large Language Models (LLMs) has been an ongoing research focus since their emergence and is even more relevant nowadays with the increasing capacity of those models. Currently, there are several guardrails in place for all public LLMs and multiple proposed datasets for training safety classifiers. However, training these safety classifiers relies on large quantities of labeled data, which can be problematic to acquire, prone to labeling errors, or often include synthetic data. To address these issues, we suggest a different approach: utilizing semi-supervised learning techniques, which leverage both labeled and unlabeled data, to improve the performance on the safety task. We analyze the improvements that these techniques can offer for both prompts given to Large Language Models and the responses to those requests. Moreover, since augmentation is the central part of semi-supervised algorithms, we demonstrate the importance of using task-specific augmentations, which significantly increase the performance when compared to general-purpose augmentation techniques.
Problem

Research questions and friction points this paper is trying to address.

Improves safety classifiers for LLMs with semi-supervised learning
Reduces reliance on large labeled datasets prone to errors
Enhances moderation of both prompts and model responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-supervised learning for safety classifiers
Task-specific augmentations enhance performance
Analyzes both prompts and LLM responses
🔎 Similar Papers
No similar papers found.
E
Eduard Stefan Dinuta
National University of Science and Technology Politehnica Bucharest
I
Iustin Sirbu
National University of Science and Technology Politehnica Bucharest, Renius Technologies
Traian Rebedea
Traian Rebedea
NVIDIA & Assoc Prof @ University Politehnica of Bucharest
Artificial IntelligenceNatural Language ProcessingMachine LearningHuman-Computer Interaction