🤖 AI Summary
Large language models (LLMs) frequently exhibit over-rejection—erroneous refusal of benign inputs—triggered by safety-related keywords, especially exacerbated in multi-turn dialogues. To systematically diagnose this rejection calibration problem, we introduce XSB (single-turn) and MS-XSB (multi-turn), the first benchmarks explicitly annotated for rejection-triggering keywords. We propose a model-agnostic, training-free, inference-time intervention framework: it identifies rejection-triggering tokens via posterior explanations and applies post-hoc correction through instruction masking, prompt rewriting, and attention steering. Evaluated on four Llama-series models, our method significantly improves response rates to benign queries while preserving robust protection against genuinely high-risk content. Our core contributions are threefold: (1) the first formal modeling of multi-turn rejection calibration; (2) the construction of an interpretable, keyword-annotated benchmark; and (3) a lightweight, safe, and controllable plug-and-play intervention paradigm.
📝 Abstract
Large language models (LLMs) frequently produce false refusals, declining benign requests that contain terms resembling unsafe queries. We address this challenge by introducing two comprehensive benchmarks: the Exaggerated Safety Benchmark (XSB) for single-turn prompts, annotated with "Focus" keywords that identify refusal-inducing triggers, and the Multi-turn Scenario-based Exaggerated Safety Benchmark (MS-XSB), which systematically evaluates refusal calibration in realistic, context-rich dialog settings. Our benchmarks reveal that exaggerated refusals persist across diverse recent LLMs and are especially pronounced in complex, multi-turn scenarios. To mitigate these failures, we leverage post-hoc explanation methods to identify refusal triggers and deploy three lightweight, model-agnostic approaches, ignore-word instructions, prompt rephrasing, and attention steering, at inference time, all without retraining or parameter access. Experiments on four instruction-tuned Llama models demonstrate that these strategies substantially improve compliance on safe prompts while maintaining robust safety protections. Our findings establish a reproducible framework for diagnosing and mitigating exaggerated refusals, highlighting practical pathways to safer and more helpful LLM deployments.