The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The field of safety evaluation for large language models (LLMs) lacks a systematic, unified survey. Method: This paper introduces the first holistic “Why/What/Where/How” four-dimensional analytical framework to rigorously distinguish safety evaluation from general model evaluation. Through bibliometric analysis, cross-methodological comparison, and taxonomy-driven modeling, it systematically synthesizes over 100 studies spanning core safety dimensions—including toxicity, bias, robustness, and truthfulness—and integrates diverse paradigms such as human evaluation, LLM-as-a-judge automation, red-teaming, and adversarial prompt engineering. Contributions: (1) A novel multi-granularity knowledge graph of LLM safety evaluation; (2) A structured resource inventory comprising 30+ benchmarks, 50+ metrics, and 20+ tools; and (3) Identification of key open challenges alongside reusable methodological pathways—providing both theoretical foundations and practical guidance for academic research and industrial deployment.

Technology Category

Application Category

📝 Abstract
With the rapid advancement of artificial intelligence technology, Large Language Models (LLMs) have demonstrated remarkable potential in the field of Natural Language Processing (NLP), including areas such as content generation, human-computer interaction, machine translation, and code generation, among others. However, their widespread deployment has also raised significant safety concerns. In recent years, LLM-generated content has occasionally exhibited unsafe elements like toxicity and bias, particularly in adversarial scenarios, which has garnered extensive attention from both academia and industry. While numerous efforts have been made to evaluate the safety risks associated with LLMs, there remains a lack of systematic reviews summarizing these research endeavors. This survey aims to provide a comprehensive and systematic overview of recent advancements in LLMs safety evaluation, focusing on several key aspects: (1)"Why evaluate"that explores the background of LLMs safety evaluation, how they differ from general LLMs evaluation, and the significance of such evaluation; (2)"What to evaluate"that examines and categorizes existing safety evaluation tasks based on key capabilities, including dimensions such as toxicity, robustness, ethics, bias and fairness, truthfulness, and so on; (3)"Where to evaluate"that summarizes the evaluation metrics, datasets and benchmarks currently used in safety evaluations; (4)"How to evaluate"that reviews existing evaluation toolkit, and categorizing mainstream evaluation methods based on the roles of the evaluators. Finally, we identify the challenges in LLMs safety evaluation and propose potential research directions to promote further advancement in this field. We emphasize the importance of prioritizing LLMs safety evaluation to ensure the safe deployment of these models in real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Systematically review safety evaluation methods for LLMs
Address toxicity, bias, and robustness in LLM outputs
Identify challenges and future directions in LLM safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic review of LLMs safety evaluation advancements
Categorizes safety tasks: toxicity, robustness, ethics, bias
Summarizes metrics, datasets, toolkits for evaluation methods
🔎 Similar Papers
No similar papers found.
S
Songyang Liu
School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing, China
Chaozhuo Li
Chaozhuo Li
Microsoft Research Aisa
J
Jiameng Qiu
School of Cyberspace Security, Jinan University, Guangzhou, China
X
Xi Zhang
School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing, China
Feiran Huang
Feiran Huang
Professor, Jinan University
Recommender systemsText-to-SQLSentiment AnalysisLLMsMultimodal Learning
Litian Zhang
Litian Zhang
Beihang University
Y
Yiming Hei
China Academy of Information and Communications Technology, Beijing, China
Philip S. Yu
Philip S. Yu
Professor of Computer Science, University of Illinons at Chicago
Data miningDatabasePrivacy