The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

The field of safety evaluation for large language models (LLMs) lacks a systematic, unified survey. Method: This paper introduces the first holistic “Why/What/Where/How” four-dimensional analytical framework to rigorously distinguish safety evaluation from general model evaluation. Through bibliometric analysis, cross-methodological comparison, and taxonomy-driven modeling, it systematically synthesizes over 100 studies spanning core safety dimensions—including toxicity, bias, robustness, and truthfulness—and integrates diverse paradigms such as human evaluation, LLM-as-a-judge automation, red-teaming, and adversarial prompt engineering. Contributions: (1) A novel multi-granularity knowledge graph of LLM safety evaluation; (2) A structured resource inventory comprising 30+ benchmarks, 50+ metrics, and 20+ tools; and (3) Identification of key open challenges alongside reusable methodological pathways—providing both theoretical foundations and practical guidance for academic research and industrial deployment.

Technology Category

Application Category

📝 Abstract

With the rapid advancement of artificial intelligence technology, Large Language Models (LLMs) have demonstrated remarkable potential in the field of Natural Language Processing (NLP), including areas such as content generation, human-computer interaction, machine translation, and code generation, among others. However, their widespread deployment has also raised significant safety concerns. In recent years, LLM-generated content has occasionally exhibited unsafe elements like toxicity and bias, particularly in adversarial scenarios, which has garnered extensive attention from both academia and industry. While numerous efforts have been made to evaluate the safety risks associated with LLMs, there remains a lack of systematic reviews summarizing these research endeavors. This survey aims to provide a comprehensive and systematic overview of recent advancements in LLMs safety evaluation, focusing on several key aspects: (1)"Why evaluate"that explores the background of LLMs safety evaluation, how they differ from general LLMs evaluation, and the significance of such evaluation; (2)"What to evaluate"that examines and categorizes existing safety evaluation tasks based on key capabilities, including dimensions such as toxicity, robustness, ethics, bias and fairness, truthfulness, and so on; (3)"Where to evaluate"that summarizes the evaluation metrics, datasets and benchmarks currently used in safety evaluations; (4)"How to evaluate"that reviews existing evaluation toolkit, and categorizing mainstream evaluation methods based on the roles of the evaluators. Finally, we identify the challenges in LLMs safety evaluation and propose potential research directions to promote further advancement in this field. We emphasize the importance of prioritizing LLMs safety evaluation to ensure the safe deployment of these models in real-world applications.

Problem

Research questions and friction points this paper is trying to address.

Systematically review safety evaluation methods for LLMs

Address toxicity, bias, and robustness in LLM outputs

Identify challenges and future directions in LLM safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic review of LLMs safety evaluation advancements

Categorizes safety tasks: toxicity, robustness, ethics, bias

Summarizes metrics, datasets, toolkits for evaluation methods

🔎 Similar Papers

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models