Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

📅 2024-09-01

🏛️ arXiv.org

📈 Citations: 15

✨ Influential: 3

career value

181K/year

🤖 AI Summary

Safety-aligned large language models (LLMs) often erroneously refuse harmless prompts (e.g., “How to kill a mosquito”), degrading usability and eroding user trust. To address this false refusal problem, we propose the first model-adaptive, content-controllable framework for generating pseudo-harmful prompts, integrating semantic perturbation, harm-aware control, and model-in-the-loop feedback. We further introduce PHTest—the first large-scale (10× larger than existing benchmarks), fine-grained evaluation benchmark for false refusal—featuring explicit separation of ambiguous/controversial prompts. Empirical evaluation across 20 mainstream LLMs reveals a significant trade-off between false refusal rate and jailbreak resistance; notably, most jailbreak defense methods substantially exacerbate false refusals. We publicly release both our code and the PHTest dataset to advance co-optimization of safety and usability.

Technology Category

Application Category

📝 Abstract

Safety-aligned large language models (LLMs) sometimes falsely refuse pseudo-harmful prompts, like"how to kill a mosquito,"which are actually harmless. Frequent false refusals not only frustrate users but also provoke a public backlash against the very values alignment seeks to protect. In this paper, we propose the first method to auto-generate diverse, content-controlled, and model-dependent pseudo-harmful prompts. Using this method, we construct an evaluation dataset called PHTest, which is ten times larger than existing datasets, covers more false refusal patterns, and separately labels controversial prompts. We evaluate 20 LLMs on PHTest, uncovering new insights due to its scale and labeling. Our findings reveal a trade-off between minimizing false refusals and improving safety against jailbreak attacks. Moreover, we show that many jailbreak defenses significantly increase the false refusal rates, thereby undermining usability. Our method and dataset can help developers evaluate and fine-tune safer and more usable LLMs. Our code and dataset are available at https://github.com/umd-huang-lab/FalseRefusal

Problem

Research questions and friction points this paper is trying to address.

Detect false refusals in safety-aligned LLMs

Generate diverse pseudo-harmful prompts automatically

Balance safety and usability in LLM responses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Auto-generate diverse pseudo-harmful prompts

Construct large-scale PHTest dataset

Evaluate trade-off safety usability LLMs

🔎 Similar Papers

Don't Say No: Jailbreaking LLM by Suppressing Refusal