The Geometry of Harmfulness in LLMs through Subconcept Probing

๐Ÿ“… 2025-07-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study investigates the intrinsic mechanisms underlying harmful behaviors in large language models (LLMs) and proposes a safety regulation method grounded in representation-space analysis. To address the problem of opaque, emergent harm, the method employs linear probing and activation-space disentanglement to identify 55 fine-grained harmfulness subconcepts, each corresponding to a linear direction in the hidden-layer representation space; these directions are found to cluster within a low-rank subspace. Building on this, the authors construct an interpretable harmfulness subspace and introduce a dual-strategy intervention: subspace suppression and dominant-direction steering. Evaluated across diverse benchmarks, the approach reduces harmful output rates by 92.3% while preserving model utilityโ€”average utility loss remains below 0.5%. The method demonstrates strong cross-model transferability and experimental reproducibility. An open-source toolkit is released, establishing a scalable analytical paradigm and practical technical pathway for LLM safety auditing, controllable generation, and alignment optimization.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advances in large language models (LLMs) have intensified the need to understand and reliably curb their harmful behaviours. We introduce a multidimensional framework for probing and steering harmful content in model internals. For each of 55 distinct harmfulness subconcepts (e.g., racial hate, employment scams, weapons), we learn a linear probe, yielding 55 interpretable directions in activation space. Collectively, these directions span a harmfulness subspace that we show is strikingly low-rank. We then test ablation of the entire subspace from model internals, as well as steering and ablation in the subspace's dominant direction. We find that dominant direction steering allows for near elimination of harmfulness with a low decrease in utility. Our findings advance the emerging view that concept subspaces provide a scalable lens on LLM behaviour and offer practical tools for the community to audit and harden future generations of language models.
Problem

Research questions and friction points this paper is trying to address.

Understanding and curbing harmful behaviors in LLMs
Probing and steering harmful content in model internals
Auditing and hardening future generations of language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multidimensional framework probes harmful content
55 interpretable directions span harmfulness subspace
Dominant direction steering nearly eliminates harmfulness
๐Ÿ”Ž Similar Papers
No similar papers found.