OR-Bench: An Over-Refusal Benchmark for Large Language Models

📅 2024-05-31

🏛️ arXiv.org

📈 Citations: 34

✨ Influential: 7

career value

170K/year

🤖 AI Summary

Safety alignment of large language models (LLMs) often induces over-refusal—i.e., erroneous rejection of harmless prompts—thereby compromising practical utility. Existing studies are hindered by the difficulty of constructing high-quality, over-refusal–inducing prompts and lack standardized, comprehensive evaluation benchmarks. Method: We introduce OR-Bench, the first large-scale, standardized benchmark for over-refusal evaluation, comprising 80K harmless prompts across 10 diverse scenarios, 1K challenging cases, and 600 strongly toxic prompts. We propose a novel automated data synthesis framework integrating controllable prompt engineering and rule-based augmentation, coupled with multi-dimensional refusal pattern modeling and rigorous human validation. Contribution/Results: We conduct systematic evaluations across 32 mainstream models spanning eight model families. All data and code are publicly released to advance safety alignment from a “toxicity-avoidance” paradigm toward a “utility-preserving” paradigm.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging due to the difficulty of crafting prompts that can elicit the over-refusal behaviors of LLMs. This study proposes a novel method for automatically generating large-scale over-refusal datasets. Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 over-refusal prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 32 popular LLMs across 8 model families. Our datasets are publicly available at https://huggingface.co/bench-llms and our codebase is open-sourced at https://github.com/justincui03/or-bench. We hope this benchmark can help the community develop better safety aligned models.

Problem

Research questions and friction points this paper is trying to address.

Measure systematic over-refusal in LLMs safety alignment

Automate generation of large-scale over-refusal datasets

Assess 32 LLMs across 10 rejection categories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated generation of large-scale over-refusal datasets

First large-scale benchmark for measuring over-refusal

Comprehensive evaluation of 32 popular LLMs

🔎 Similar Papers

No similar papers found.