SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

๐Ÿ“… 2024-06-20
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 36
โœจ Influential: 6
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing LLM safety refusal evaluation benchmarks suffer from coarse-grained topic categorization, narrow coverage of languages/formats, and high computational costs. Method: We propose the first fine-grained, language-robust, and computationally efficient benchmark: (1) a fine-grained taxonomy covering 44 imbalanced risk categories, paired with a class-balanced instruction set; (2) explicit multilingual and dialectal augmentation to enhance linguistic robustness; (3) a lightweight 7B supervised fine-tuned model trained as an automatic safety referee, replacing GPT-4. Contribution/Results: Evaluated on 7K+ human-annotated samples across 50+ open- and closed-source LLMs, our referee achieves >92% accuracyโ€”on par with GPT-4โ€”while reducing inference cost by orders of magnitude. This enables scalable, reproducible, and fine-grained analysis of LLM safety behavior.

Technology Category

Application Category

๐Ÿ“ Abstract
Evaluating aligned large language models' (LLMs) ability to recognize and reject unsafe user requests is crucial for safe, policy-compliant deployments. Existing evaluation efforts, however, face three limitations that we address with SORRY-Bench, our proposed benchmark. First, existing methods often use coarse-grained taxonomies of unsafe topics, and are over-representing some fine-grained topics. For example, among the ten existing datasets that we evaluated, tests for refusals of self-harm instructions are over 3x less represented than tests for fraudulent activities. SORRY-Bench improves on this by using a fine-grained taxonomy of 44 potentially unsafe topics, and 440 class-balanced unsafe instructions, compiled through human-in-the-loop methods. Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations. We supplement SORRY-Bench with 20 diverse linguistic augmentations to systematically examine these effects. Third, existing evaluations rely on large LLMs (e.g., GPT-4) for evaluation, which can be computationally expensive. We investigate design choices for creating a fast, accurate automated safety evaluator. By collecting 7K+ human annotations and conducting a meta-evaluation of diverse LLM-as-a-judge designs, we show that fine-tuned 7B LLMs can achieve accuracy comparable to GPT-4 scale LLMs, with lower computational cost. Putting these together, we evaluate over 50 proprietary and open-weight LLMs on SORRY-Bench, analyzing their distinctive safety refusal behaviors. We hope our effort provides a building block for systematic evaluations of LLMs' safety refusal capabilities, in a balanced, granular, and efficient manner. Benchmark demo, data, code, and models are available through https://sorry-bench.github.io.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to recognize and reject unsafe user requests.
Addressing limitations in existing safety evaluation benchmarks.
Developing a cost-effective, accurate automated safety evaluator for LLMs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained taxonomy for unsafe topics
Diverse linguistic augmentations in prompts
Efficient fine-tuned 7B LLM evaluator