🤖 AI Summary
Existing LLM-based content moderation tools struggle to accurately assess fine-grained risk levels of generated content, particularly exhibiting low recall for low-severity harmful outputs—hindering platform-level implementation of differentiated safety policies. To address this, we propose BingoGuard: (1) a novel severity scoring framework tailored to 11 distinct harmful categories; (2) a generation-filtering co-design paradigm to construct a high-quality, multi-level risk-annotated dataset; and (3) an integrated training strategy combining supervised fine-tuning, generative data synthesis, and multi-level risk modeling—enabling both binary safety classification and three-tier severity grading on an 8B-parameter model. Evaluated on WildGuardTest, HarmBench, and our newly curated BingoGuardTest benchmark, BingoGuard-8B achieves state-of-the-art performance, outperforming WildGuard by +4.3% overall and demonstrating significantly improved detection of low-severity harmful content.
📝 Abstract
Malicious content generated by large language models (LLMs) can pose varying degrees of harm. Although existing LLM-based moderators can detect harmful content, they struggle to assess risk levels and may miss lower-risk outputs. Accurate risk assessment allows platforms with different safety thresholds to tailor content filtering and rejection. In this paper, we introduce per-topic severity rubrics for 11 harmful topics and build BingoGuard, an LLM-based moderation system designed to predict both binary safety labels and severity levels. To address the lack of annotations on levels of severity, we propose a scalable generate-then-filter framework that first generates responses across different severity levels and then filters out low-quality responses. Using this framework, we create BingoGuardTrain, a training dataset with 54,897 examples covering a variety of topics, response severity, styles, and BingoGuardTest, a test set with 988 examples explicitly labeled based on our severity rubrics that enables fine-grained analysis on model behaviors on different severity levels. Our BingoGuard-8B, trained on BingoGuardTrain, achieves the state-of-the-art performance on several moderation benchmarks, including WildGuardTest and HarmBench, as well as BingoGuardTest, outperforming best public models, WildGuard, by 4.3%. Our analysis demonstrates that incorporating severity levels into training significantly enhances detection performance and enables the model to effectively gauge the severity of harmful responses.