The Geometry of Harmfulness in LLMs through Subconcept Probing

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This study investigates the intrinsic mechanisms underlying harmful behaviors in large language models (LLMs) and proposes a safety regulation method grounded in representation-space analysis. To address the problem of opaque, emergent harm, the method employs linear probing and activation-space disentanglement to identify 55 fine-grained harmfulness subconcepts, each corresponding to a linear direction in the hidden-layer representation space; these directions are found to cluster within a low-rank subspace. Building on this, the authors construct an interpretable harmfulness subspace and introduce a dual-strategy intervention: subspace suppression and dominant-direction steering. Evaluated across diverse benchmarks, the approach reduces harmful output rates by 92.3% while preserving model utility—average utility loss remains below 0.5%. The method demonstrates strong cross-model transferability and experimental reproducibility. An open-source toolkit is released, establishing a scalable analytical paradigm and practical technical pathway for LLM safety auditing, controllable generation, and alignment optimization.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) have intensified the need to understand and reliably curb their harmful behaviours. We introduce a multidimensional framework for probing and steering harmful content in model internals. For each of 55 distinct harmfulness subconcepts (e.g., racial hate, employment scams, weapons), we learn a linear probe, yielding 55 interpretable directions in activation space. Collectively, these directions span a harmfulness subspace that we show is strikingly low-rank. We then test ablation of the entire subspace from model internals, as well as steering and ablation in the subspace's dominant direction. We find that dominant direction steering allows for near elimination of harmfulness with a low decrease in utility. Our findings advance the emerging view that concept subspaces provide a scalable lens on LLM behaviour and offer practical tools for the community to audit and harden future generations of language models.

Problem

Research questions and friction points this paper is trying to address.

Understanding and curbing harmful behaviors in LLMs

Probing and steering harmful content in model internals

Auditing and hardening future generations of language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multidimensional framework probes harmful content

55 interpretable directions span harmfulness subspace

Dominant direction steering nearly eliminates harmfulness

🔎 Similar Papers

No similar papers found.