Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

📅 2024-10-04

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 1

career value

177K/year

🤖 AI Summary

This work addresses the “false rejection” problem in large language models (LLMs)—the erroneous classification of superficially similar yet benign inputs as harmful, leading to unwarranted response refusal. We propose a training-free, model-agnostic, single-vector ablation method. Its core innovation is the first formal definition and localization of an interpretable “false-rejection vector” via activation-space analysis, enabling fine-grained safety calibration. The method linearly ablates this vector in a lightweight, plug-and-play manner, preserving both the model’s original capabilities and its genuine safety performance. Extensive evaluation across multiple open- and closed-source LLMs demonstrates an average 42% reduction in false rejection rate, while maintaining >98% rejection rate for truly harmful queries and sustaining zero degradation on general-purpose benchmarks. This work establishes a new paradigm for safety alignment—efficient, transparent, and broadly generalizable.

Technology Category

Application Category

📝 Abstract

Training a language model to be both helpful and harmless requires careful calibration of refusal behaviours: Models should refuse to follow malicious instructions or give harmful advice (e.g."how do I kill someone?"), but they should not refuse safe requests, even if they superficially resemble unsafe ones (e.g."how do I kill a Python process?"). Avoiding such false refusal, as prior work has shown, is challenging even for highly-capable language models. In this paper, we propose a simple and surgical method for mitigating false refusal in language models via single vector ablation. For a given model, we extract a false refusal vector and show that ablating this vector reduces false refusal rate while preserving the model's safety and general capabilities. We also show that our approach can be used for fine-grained calibration of model safety. Our approach is training-free and model-agnostic, making it useful for mitigating the problem of false refusal in current and future language models.

Problem

Research questions and friction points this paper is trying to address.

Mitigating false refusal in language models

Calibrating refusal behaviors for safety

Training-free, model-agnostic safety calibration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single vector ablation reduces false refusal.

Training-free method preserves model safety.

Model-agnostic approach for fine-grained calibration.

🔎 Similar Papers

Robust LLM safeguarding via refusal feature adversarial training