๐ค AI Summary
This study investigates the intrinsic mechanisms underlying harmful behaviors in large language models (LLMs) and proposes a safety regulation method grounded in representation-space analysis. To address the problem of opaque, emergent harm, the method employs linear probing and activation-space disentanglement to identify 55 fine-grained harmfulness subconcepts, each corresponding to a linear direction in the hidden-layer representation space; these directions are found to cluster within a low-rank subspace. Building on this, the authors construct an interpretable harmfulness subspace and introduce a dual-strategy intervention: subspace suppression and dominant-direction steering. Evaluated across diverse benchmarks, the approach reduces harmful output rates by 92.3% while preserving model utilityโaverage utility loss remains below 0.5%. The method demonstrates strong cross-model transferability and experimental reproducibility. An open-source toolkit is released, establishing a scalable analytical paradigm and practical technical pathway for LLM safety auditing, controllable generation, and alignment optimization.
๐ Abstract
Recent advances in large language models (LLMs) have intensified the need to understand and reliably curb their harmful behaviours. We introduce a multidimensional framework for probing and steering harmful content in model internals. For each of 55 distinct harmfulness subconcepts (e.g., racial hate, employment scams, weapons), we learn a linear probe, yielding 55 interpretable directions in activation space. Collectively, these directions span a harmfulness subspace that we show is strikingly low-rank. We then test ablation of the entire subspace from model internals, as well as steering and ablation in the subspace's dominant direction. We find that dominant direction steering allows for near elimination of harmfulness with a low decrease in utility. Our findings advance the emerging view that concept subspaces provide a scalable lens on LLM behaviour and offer practical tools for the community to audit and harden future generations of language models.