SafeLawBench: Towards Safe Alignment of Large Language Models

📅 2025-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM safety evaluation lacks objective, reproducible benchmarks, with existing methods suffering from subjectivity and ill-defined criteria. To address this, we introduce SafeLawBench—the first legal-normative benchmark for LLM safety assessment—comprising 24,860 multiple-choice questions and 1,106 open-ended QA tasks. It features a novel three-tier legal risk taxonomy and establishes a multi-granular, verifiable safety alignment evaluation framework. We systematically evaluate 20 state-of-the-art models (including GPT-4o and Claude-3.5-Sonnet) using zero-shot/few-shot prompting, refusal behavior analysis, reasoning stability testing, and majority-voting ensembling. Results reveal that even top-performing models achieve only 80.5% accuracy, with an overall average of 68.8%, exposing a substantial gap in legal compliance. SafeLawBench provides a high-fidelity, reproducible, law-grounded evaluation paradigm for advancing LLM safety alignment research.

Technology Category

Application Category

📝 Abstract
With the growing prevalence of large language models (LLMs), the safety of LLMs has raised significant concerns. However, there is still a lack of definitive standards for evaluating their safety due to the subjective nature of current safety benchmarks. To address this gap, we conducted the first exploration of LLMs' safety evaluation from a legal perspective by proposing the SafeLawBench benchmark. SafeLawBench categorizes safety risks into three levels based on legal standards, providing a systematic and comprehensive framework for evaluation. It comprises 24,860 multi-choice questions and 1,106 open-domain question-answering (QA) tasks. Our evaluation included 2 closed-source LLMs and 18 open-source LLMs using zero-shot and few-shot prompting, highlighting the safety features of each model. We also evaluated the LLMs' safety-related reasoning stability and refusal behavior. Additionally, we found that a majority voting mechanism can enhance model performance. Notably, even leading SOTA models like Claude-3.5-Sonnet and GPT-4o have not exceeded 80.5% accuracy in multi-choice tasks on SafeLawBench, while the average accuracy of 20 LLMs remains at 68.8%. We urge the community to prioritize research on the safety of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Lack of definitive safety standards for LLM evaluation
Subjective nature of current safety benchmarks
Need for legal perspective in safety assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes SafeLawBench benchmark for legal safety evaluation
Includes 24,860 multi-choice and 1,106 open QA tasks
Uses majority voting to enhance model performance
🔎 Similar Papers
No similar papers found.
Chuxue Cao
Chuxue Cao
Hong Kong University of Science and Technology
H
Han Zhu
Hong Kong University of Science and Technology
J
Jiaming Ji
Peking University
Q
Qichao Sun
Hong Kong University of Science and Technology
Z
Zhenghao Zhu
Hong Kong University of Science and Technology
Yinyu Wu
Yinyu Wu
Hong Kong University of Science and Technology
J
Juntao Dai
Peking University
Y
Yaodong Yang
Peking University
Sirui Han
Sirui Han
The Hong Kong University of Science and Technology
Large Language ModelInterdisciplinary Artificial Intelligence
Y
Yike Guo
Hong Kong University of Science and Technology