Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models

📅 2024-12-09

🏛️ arXiv.org

📈 Citations: 8

✨ Influential: 2

career value

181K/year

🤖 AI Summary

Fine-grained, on-demand control of large language models’ refusal behavior remains challenging due to rigid, monolithic safety alignment. Method: We propose a plug-and-play refusal token mechanism that enables real-time, multi-dimensional, and fine-grained refusal rate control for a single pre-trained model—without any fine-tuning. Prior to generation, category-specific refusal tokens are injected; during inference, logit intervention and probability reweighting dynamically modulate their output probabilities, allowing precise refusal rate adjustment across sensitive query types (e.g., illegal, out-of-scope, or ambiguous queries). Results: Evaluated on multiple safety benchmarks, our method achieves ±1.2% refusal rate control error, supports instantaneous switching among user-defined refusal policies, and requires no additional training or model duplication. Its core contribution is the decoupling of refusal control from model parameters—establishing an efficient, lightweight paradigm for controllable safety alignment.

Technology Category

Application Category

📝 Abstract

A key component of building safe and reliable language models is enabling the models to appropriately refuse to follow certain instructions or answer certain questions. We may want models to output refusal messages for various categories of user queries, for example, ill-posed questions, instructions for committing illegal acts, or queries which require information past the model's knowledge horizon. Engineering models that refuse to answer such questions is complicated by the fact that an individual may want their model to exhibit varying levels of sensitivity for refusing queries of various categories, and different users may want different refusal rates. The current default approach involves training multiple models with varying proportions of refusal messages from each category to achieve the desired refusal rates, which is computationally expensive and may require training a new model to accommodate each user's desired preference over refusal rates. To address these challenges, we propose refusal tokens, one such token for each refusal category or a single refusal token, which are prepended to the model's responses during training. We then show how to increase or decrease the probability of generating the refusal token for each category during inference to steer the model's refusal behavior. Refusal tokens enable controlling a single model's refusal rates without the need of any further fine-tuning, but only by selectively intervening during generation.

Problem

Research questions and friction points this paper is trying to address.

Calibrating refusal behavior in large language models

Controlling refusal rates without fine-tuning multiple models

Enabling adjustable sensitivity to different query categories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Refusal tokens control refusal rates

Intervene during generation without fine-tuning

Single token per category steers behavior

🔎 Similar Papers

No similar papers found.