🤖 AI Summary
This study investigates how large language models encode relational knowledge during text generation, focusing on which internal representations are amenable to linear probing for relation classification and why certain relations are more linearly decodable than others. Through systematic evaluation of latent representations from attention heads and MLP outputs—combined with linear probes, feature attribution, attention decomposition, and residual stream tracing—we identify the contribution of attention heads to the residual stream as a key signal for relation classification. Our analysis reveals that probe performance strongly correlates with relation specificity, entity connectivity, and the distribution of relational signals across attention heads. The work pinpoints optimal representation sources, establishes an interpretable link between relational properties and probing efficacy, and introduces a fine-grained, token-level attribution method for detailed analysis.
📝 Abstract
We study how large language models recall relational knowledge during text generation, with a focus on identifying latent representations suitable for relation classification via linear probes. Prior work shows how attention heads and MLPs interact to resolve subject, predicate, and object, but it remains unclear which representations support faithful linear relation classification and why some relation types are easier to capture linearly than others. We systematically evaluate different latent representations derived from attention head and MLP contributions, showing that per-head attention contributions to the residual stream are comparatively strong features for linear relation classification. Feature attribution analyses of the trained probes, as well as characteristics of the different relation types, reveal clear correlations between probe accuracy and relation specificity, entity connectedness, and how distributed the signal on which the probe relies is across attention heads. Finally, we show how token-level feature attribution of probe predictions can be used to reveal probe behavior in further detail.