A generative approach to LLM harmfulness detection with special red flag tokens

📅 2025-02-22

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing LLM safety fine-tuning methods rely on refusal-based strategies, which induce abrupt shifts in output distribution, degrade model capabilities, and are vulnerable to sampling attacks (e.g., generating initial affirmative responses). Method: We propose a generative harmfulness detection paradigm: via vocabulary expansion, we introduce learnable “red-flag tokens” that the model proactively emits *before or during* generating harmful content, enabling fine-grained, end-to-end controllable harmfulness identification and interception. This transforms the LLM into a real-time generative classifier, decoupling safety judgment from text generation. Contribution/Results: Our approach significantly outperforms refusal-based baselines across multiple safety benchmarks while preserving ≥98% of task performance. It demonstrates superior robustness against sampling attacks, long-context perturbations, and SFT-based adversarial attacks, and enhances interpretability of safety evaluations.

Technology Category

Application Category

📝 Abstract

Most safety training methods for large language models (LLMs) based on fine-tuning rely on dramatically changing the output distribution of the model when faced with a harmful request, shifting it from an unsafe answer to a refusal to respond. These methods inherently compromise model capabilities and might make auto-regressive models vulnerable to attacks that make likely an initial token of affirmative response. To avoid that, we propose to expand the model's vocabulary with a special token we call red flag token () and propose to fine-tune the model to generate this token at any time harmful content is generated or about to be generated. This novel safety training method effectively augments LLMs into generative classifiers of harmfulness at all times during the conversation. This method offers several advantages: it enables the model to explicitly learn the concept of harmfulness while marginally affecting the generated distribution, thus maintaining the model's utility. It also evaluates each generated answer rather than just the input prompt and provides a stronger defence against sampling-based attacks. In addition, it simplifies the evaluation of the model's robustness and reduces correlated failures when combined with a classifier. We further show an increased robustness to long contexts, and supervised fine-tuning attacks.

Problem

Research questions and friction points this paper is trying to address.

Detecting harmfulness in LLM outputs

Introducing red flag tokens for safety

Maintaining model utility during safety training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces red flag token <rf>

Fine-tunes model for harmfulness detection

Maintains model utility while enhancing safety

🔎 Similar Papers

No similar papers found.