BingoGuard: LLM Content Moderation Tools with Risk Levels

📅 2025-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based content moderation tools struggle to accurately assess fine-grained risk levels of generated content, particularly exhibiting low recall for low-severity harmful outputs—hindering platform-level implementation of differentiated safety policies. To address this, we propose BingoGuard: (1) a novel severity scoring framework tailored to 11 distinct harmful categories; (2) a generation-filtering co-design paradigm to construct a high-quality, multi-level risk-annotated dataset; and (3) an integrated training strategy combining supervised fine-tuning, generative data synthesis, and multi-level risk modeling—enabling both binary safety classification and three-tier severity grading on an 8B-parameter model. Evaluated on WildGuardTest, HarmBench, and our newly curated BingoGuardTest benchmark, BingoGuard-8B achieves state-of-the-art performance, outperforming WildGuard by +4.3% overall and demonstrating significantly improved detection of low-severity harmful content.

Technology Category

Application Category

📝 Abstract
Malicious content generated by large language models (LLMs) can pose varying degrees of harm. Although existing LLM-based moderators can detect harmful content, they struggle to assess risk levels and may miss lower-risk outputs. Accurate risk assessment allows platforms with different safety thresholds to tailor content filtering and rejection. In this paper, we introduce per-topic severity rubrics for 11 harmful topics and build BingoGuard, an LLM-based moderation system designed to predict both binary safety labels and severity levels. To address the lack of annotations on levels of severity, we propose a scalable generate-then-filter framework that first generates responses across different severity levels and then filters out low-quality responses. Using this framework, we create BingoGuardTrain, a training dataset with 54,897 examples covering a variety of topics, response severity, styles, and BingoGuardTest, a test set with 988 examples explicitly labeled based on our severity rubrics that enables fine-grained analysis on model behaviors on different severity levels. Our BingoGuard-8B, trained on BingoGuardTrain, achieves the state-of-the-art performance on several moderation benchmarks, including WildGuardTest and HarmBench, as well as BingoGuardTest, outperforming best public models, WildGuard, by 4.3%. Our analysis demonstrates that incorporating severity levels into training significantly enhances detection performance and enables the model to effectively gauge the severity of harmful responses.
Problem

Research questions and friction points this paper is trying to address.

Detect and assess varying risk levels of LLM-generated malicious content.
Develop a scalable framework for generating and filtering severity-labeled data.
Enhance content moderation by incorporating severity levels into model training.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Per-topic severity rubrics for harmful content
Generate-then-filter framework for scalable annotations
BingoGuard-8B model with enhanced severity detection
🔎 Similar Papers
No similar papers found.