EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) often exhibit over-refusal—rejecting semantically harmless inputs due to excessive safety alignment, undermining usability. Method: We propose EVOREFUSE, the first evolutionary algorithm-based prompt optimization framework, guided by evidence lower bound (ELBO) maximization to generate high-probability refusal-inducing instructions. We further introduce EVOREFUSE-TEST/ALIGN, the first dual-purpose benchmark dataset supporting both over-refusal evaluation and alignment refinement. Contribution/Results: On EVOREFUSE-TEST, average refusal-triggering rates increase by 140.41% across nine mainstream LLMs. Using EVOREFUSE-ALIGN for supervised fine-tuning and preference alignment, Llama3.1-8B reduces over-refusal by 14.31% without compromising safety performance. Our work establishes a scalable, diverse, and theoretically grounded paradigm for evaluating and controllably aligning LLM refusal behavior.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) frequently refuse to respond to pseudo-malicious instructions: semantically harmless input queries triggering unnecessary LLM refusals due to conservative safety alignment, significantly impairing user experience. Collecting such instructions is crucial for evaluating and mitigating over-refusals, but existing instruction curation methods, like manual creation or instruction rewriting, either lack scalability or fail to produce sufficiently diverse and effective refusal-inducing prompts. To address these limitations, we introduce EVOREFUSE, a prompt optimization approach that generates diverse pseudo-malicious instructions consistently eliciting confident refusals across LLMs. EVOREFUSE employs an evolutionary algorithm exploring the instruction space in more diverse directions than existing methods via mutation strategies and recombination, and iteratively evolves seed instructions to maximize evidence lower bound on LLM refusal probability. Using EVOREFUSE, we create two novel datasets: EVOREFUSE-TEST, a benchmark of 582 pseudo-malicious instructions that outperforms the next-best benchmark with 140.41% higher average refusal triggering rate across 9 LLMs, 34.86% greater lexical diversity, and 40.03% improved LLM response confidence scores; and EVOREFUSE-ALIGN, which provides 3,000 pseudo-malicious instructions with responses for supervised and preference-based alignment training. LLAMA3.1-8B-INSTRUCT supervisedly fine-tuned on EVOREFUSE-ALIGN achieves up to 14.31% fewer over-refusals than models trained on the second-best alignment dataset, without compromising safety. Our analysis with EVOREFUSE-TEST reveals models trigger over-refusals by overly focusing on sensitive keywords while ignoring broader context.
Problem

Research questions and friction points this paper is trying to address.

Mitigating LLM over-refusal to pseudo-malicious instructions
Generating diverse refusal-inducing prompts via evolutionary optimization
Improving alignment datasets to reduce unnecessary LLM refusals
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evolutionary algorithm for diverse prompt optimization
Maximizes LLM refusal probability via iterative evolution
Creates datasets for testing and alignment training
🔎 Similar Papers
No similar papers found.
X
Xiaorui Wu
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China
Xiaofeng Mao
Xiaofeng Mao
Alibaba Group
Computer VisionAdversarial Machine Learning
F
Fei Li
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China
X
Xin Zhang
Ant Group
X
Xiaolu Zhang
Ant Group
J
Jun Zhou
Ant Group
Yuxiang Peng
Yuxiang Peng
University of Delaware
Efficient VINS
L
Li Zheng
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China
C
Chong Teng
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China
Donghong Ji
Donghong Ji
Wuhan University
Artificial IntelligenceNatural Language Processing
Z
Zhuang Li
School of Computing Technologies, Royal Melbourne Institute of Technology, Australia