Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the issue of over-refusal in large language models during safety alignment, where benign user requests are erroneously rejected due to "refusal triggers" entangled with non-harmful content, thereby compromising usability. The study provides the first mechanistic explanation, revealing that this problem stems from the coupling of harmless and harmful signals in training data. To mitigate this, the authors propose a method that explicitly identifies and modulates refusal triggers during fine-tuning. By integrating adversarial evaluation with semantic analysis, the approach significantly reduces over-refusal rates across multiple benchmarks while preserving robust defenses against jailbreak attacks, achieving a superior balance between safety and usability.

Technology Category

Application Category

📝 Abstract

Safety alignment aims to ensure that large language models (LLMs) refuse harmful requests by post-training on harmful queries paired with refusal answers. Although safety alignment is widely adopted in industry, the overrefusal problem where aligned LLMs also reject benign queries after safety alignment post-training, remains insufficiently studied. Such an issue degrades the usability of safety alignment in real-world applications. In this paper, we examine how overrefusal arises under safety alignment, and propose a mitigation strategy inspired by our findings. We define refusal triggers as linguistic cues in the training data that elicit refusal responses, safety alignment encourages LLMs to associate refusal triggers within a training sample with refusal responses, leading aligned LLMs to refuse harmful queries. However, the refusal triggers include not only harmful linguistic cues but also non-harmful cues, therefore causing overrefusal to benign queries. Building on this mechanistic analysis, we propose a method that explicitly considers refusal triggers in the safety alignment fine-tuning. Empirical results demonstrate that our approach achieves a more favorable trade-off between defense against jailbreak attacks and responsiveness to benign queries, outperforming prior methods. Warning: this paper contains harmful and biased sentences.

Problem

Research questions and friction points this paper is trying to address.

overrefusal

safety alignment

large language models

refusal triggers

benign queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

overrefusal

refusal triggers

safety alignment