Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior

📅 2025-03-22

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the “false rejection” problem in large language models (LLMs), wherein harmless queries—e.g., “How to terminate a Python process”—are erroneously classified as harmful due to overzealous safety alignment. We propose “Think-Before-Reject” (TBR), a novel paradigm that explicitly integrates safety reflection into instruction tuning. TBR employs structured prompting to guide models to autonomously assess query harmfulness *before* generating a response, thereby enabling precise discrimination between malicious and benign instructions. Experiments across 15 pre-trained models demonstrate that TBR significantly reduces false rejection rates while strictly preserving original safety compliance and language modeling performance. To our knowledge, this is the first general-purpose solution that systematically embeds safety reflection into the instruction-tuning pipeline, achieving a principled balance between security and usability.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have demonstrated that fine-tuning and human alignment can render LLMs harmless. In practice, such"harmlessness"behavior is mainly achieved by training models to reject harmful requests, such as"Explain how to burn down my neighbor's house", where the model appropriately declines to respond. However, this approach can inadvertently result in false refusal, where models reject benign queries as well, such as"Tell me how to kill a Python process". In this work, we demonstrate that prompting safety reflection before generating a response can mitigate false refusal behavior. Building on this finding, we introduce the Think-Before-Refusal (TBR) schema and conduct safety-aware instruction fine-tuning incorporating safety reflection. In an ablation study across 15 pre-trained models, we show that models fine-tuned with safety reflection significantly reduce false refusal behavior while maintaining safety and overall performance compared to those fine-tuned without safety reflection.

Problem

Research questions and friction points this paper is trying to address.

Mitigating false refusal in LLMs for benign queries

Introducing safety reflection to reduce incorrect rejections

Maintaining model safety while minimizing false refusal behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompting safety reflection before response

Introducing Think-Before-Refusal schema

Safety-aware instruction fine-tuning

🔎 Similar Papers

Know Your Limits: A Survey of Abstention in Large Language Models