SafeDialBench: A Fine-Grained Safety Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks

📅 2025-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM safety evaluations predominantly focus on single-turn interactions or isolated attack types, lacking fine-grained assessment of hazardous content identification, mitigation, and response consistency across multi-turn dialogues. This work introduces MultiRoundSafe—the first fine-grained, multi-turn safety benchmark—covering bilingual (Chinese/English) content, 22 high-risk scenario categories, and over 4,000 multi-turn dialogues. It systematically evaluates model robustness against seven prevalent jailbreak attack strategies. We propose a two-tier safety classification scheme (six dimensions) and an integrated evaluation framework jointly assessing detection, mitigation, and inter-turn consistency, incorporating multi-scenario modeling, cross-lingual generation, and human-in-the-loop scoring. Extensive experiments across 17 state-of-the-art models reveal that Yi-34B-Chat and GLM4-9B-Chat achieve the highest safety scores, whereas Llama3.1-8B-Instruct and o3-mini exhibit significant vulnerabilities.

Technology Category

Application Category

📝 Abstract
With the rapid advancement of Large Language Models (LLMs), the safety of LLMs has been a critical concern requiring precise assessment. Current benchmarks primarily concentrate on single-turn dialogues or a single jailbreak attack method to assess the safety. Additionally, these benchmarks have not taken into account the LLM's capability of identifying and handling unsafe information in detail. To address these issues, we propose a fine-grained benchmark SafeDialBench for evaluating the safety of LLMs across various jailbreak attacks in multi-turn dialogues. Specifically, we design a two-tier hierarchical safety taxonomy that considers 6 safety dimensions and generates more than 4000 multi-turn dialogues in both Chinese and English under 22 dialogue scenarios. We employ 7 jailbreak attack strategies, such as reference attack and purpose reverse, to enhance the dataset quality for dialogue generation. Notably, we construct an innovative assessment framework of LLMs, measuring capabilities in detecting, and handling unsafe information and maintaining consistency when facing jailbreak attacks. Experimental results across 17 LLMs reveal that Yi-34B-Chat and GLM4-9B-Chat demonstrate superior safety performance, while Llama3.1-8B-Instruct and o3-mini exhibit safety vulnerabilities.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLM safety in multi-turn dialogues
Evaluating diverse jailbreak attack strategies
Developing fine-grained safety taxonomy and metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn dialogue safety assessment
Two-tier hierarchical safety taxonomy
Seven diverse jailbreak attack strategies
Hongye Cao
Hongye Cao
Chang'an University
Remote sensing
Y
Yanming Wang
National Key Laboratory for Novel Software Technology, Nanjing University
S
Sijia Jing
National Key Laboratory for Novel Software Technology, Nanjing University
Z
Ziyue Peng
National Key Laboratory for Novel Software Technology, Nanjing University
Zhixin Bai
Zhixin Bai
Harbin Institute of Technology
natural language processing
Z
Zhe Cao
National Key Laboratory for Novel Software Technology, Nanjing University
Meng Fang
Meng Fang
University of Liverpool
Natural Language ProcessingReinforcement LearningAgentsArtificial intelligence
F
Fan Feng
University of California, San Diego
J
Jiaheng Liu
B
Boyan Wang
National Key Laboratory for Novel Software Technology, Nanjing University
T
Tianpei Yang
National Key Laboratory for Novel Software Technology, Nanjing University
Jing Huo
Jing Huo
Nanjing University
Machine LearningComputer Vision
Y
Yang Gao
National Key Laboratory for Novel Software Technology, Nanjing University
F
Fanyu Meng
China Mobile Research Institute, Beijing
X
Xi Yang
China Mobile (Suzhou) Software Technology Co., Ltd. Suzhou
C
Chao Deng
China Mobile Research Institute, Beijing
Junlan Feng
Junlan Feng
Chief Scientist at China Mobile Research
Natural LanguageMachine LearningSpeech ProcessingData Mining