LLMs Encode Harmfulness and Refusal Separately

📅 2025-07-15

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work investigates whether large language models (LLMs) possess intrinsic semantic understanding of *harmfulness*—beyond merely learning surface-level refusal behaviors—when rejecting harmful instructions. Using causal steering in the latent space, we disentangle a *harmfulness direction* orthogonal to the refusal direction, revealing that the model’s internal representation of harm is more stable than its refusal behavior. Building on this insight, we propose Latent Guard: an implicit safety mechanism that detects harmfulness directly via the identified latent direction, eliminating the need for an explicit classification head. Experiments show that Latent Guard matches or surpasses Llama Guard 3-8B in detection accuracy across diverse jailbreaking attacks, significantly reduces over-refusal, and exhibits strong robustness against adversarial fine-tuning attacks. This work establishes a novel, interpretable, and generalizable representation-level perspective for AI safety mechanisms.

Technology Category

Application Category

📝 Abstract

LLMs are trained to refuse harmful instructions, but do they truly understand harmfulness beyond just refusing? Prior work has shown that LLMs' refusal behaviors can be mediated by a one-dimensional subspace, i.e., a refusal direction. In this work, we identify a new dimension to analyze safety mechanisms in LLMs, i.e., harmfulness, which is encoded internally as a separate concept from refusal. There exists a harmfulness direction that is distinct from the refusal direction. As causal evidence, steering along the harmfulness direction can lead LLMs to interpret harmless instructions as harmful, but steering along the refusal direction tends to elicit refusal responses directly without reversing the model's judgment on harmfulness. Furthermore, using our identified harmfulness concept, we find that certain jailbreak methods work by reducing the refusal signals without reversing the model's internal belief of harmfulness. We also find that adversarially finetuning models to accept harmful instructions has minimal impact on the model's internal belief of harmfulness. These insights lead to a practical safety application: The model's latent harmfulness representation can serve as an intrinsic safeguard (Latent Guard) for detecting unsafe inputs and reducing over-refusals that is robust to finetuning attacks. For instance, our Latent Guard achieves performance comparable to or better than Llama Guard 3 8B, a dedicated finetuned safeguard model, across different jailbreak methods. Our findings suggest that LLMs' internal understanding of harmfulness is more robust than their refusal decision to diverse input instructions, offering a new perspective to study AI safety

Problem

Research questions and friction points this paper is trying to address.

Identify harmfulness as separate from refusal in LLMs

Analyze how jailbreak methods bypass refusal signals

Propose latent harmfulness representation for robust safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies separate harmfulness and refusal directions in LLMs

Uses harmfulness direction to detect unsafe inputs robustly

Proposes Latent Guard for safety without over-refusal

🔎 Similar Papers

Don't Say No: Jailbreaking LLM by Suppressing Refusal