🤖 AI Summary
This paper addresses the unsupervised identification of linear concept subspaces in language models, proposing the first geometric causal probing framework grounded in intrinsic geometry and causal criteria—eliminating reliance on auxiliary classification tasks. Methodologically, it formalizes the ideal concept subspace via information-theoretic principles, theoretically establishes that LEACE-decorrelated subspaces admit intervenable causal semantic structure, and designs a concept manipulation mechanism within the generative process. Key contributions include: (1) the first intrinsic, annotation-free criterion for identifying concept subspaces; (2) empirical evidence that one-dimensional subspaces extracted by LEACE encode approximately 50% of verb-number concept information; and (3) high-precision causal intervention and controllable generation over concept features such as generated token count. The framework bridges geometric representation learning with causal interpretability, enabling direct, task-agnostic manipulation of linguistic concepts in pretrained language models.
📝 Abstract
The linear subspace hypothesis (Bolukbasi et al., 2016) states that, in a language model's representation space, all information about a concept such as verbal number is encoded in a linear subspace. Prior work has relied on auxiliary classification tasks to identify and evaluate candidate subspaces that might give support for this hypothesis. We instead give a set of intrinsic criteria which characterize an ideal linear concept subspace and enable us to identify the subspace using only the language model distribution. Our information-theoretic framework accounts for spuriously correlated features in the representation space (Kumar et al., 2022). As a byproduct of this analysis, we hypothesize a causal process for how a language model might leverage concepts during generation. Empirically, we find that LEACE (Belrose et al., 2023) returns a one-dimensional subspace containing roughly half of total concept information under our framework for verbal-number. Our causal intervention for controlled generation shows that, for at least one concept, the subspace returned by LEACE can be used to manipulate the concept value of the generated word with precision.