🤖 AI Summary
This study investigates how large language models encode semantic relations—such as synonymy, antonymy, and hypernymy—by systematically analyzing the location and feature contributions of semantic representations across models of varying scales. Leveraging interpretability methods including linear probing, sparse autoencoders (SAEs), and activation patching, the work reveals an asymmetry in the encoding of hierarchical semantic relationships. Semantic signals are found to be strongest in middle layers and primarily propagated through the MLP pathways. Notably, in Llama 3.1, SAEs enable reliable interventions on semantic representations, whereas smaller models exhibit weak and inconsistent effects. The research establishes a reproducible framework linking sparse features to causal evidence from probing, offering a novel pathway toward understanding the mechanistic underpinnings of semantic representation in language models.
📝 Abstract
Understanding whether large language models (LLMs) capture structured meaning requires examining how they represent concept relationships. In this work, we study three models of increasing scale: Pythia-70M, GPT-2, and Llama 3.1 8B, focusing on four semantic relations: synonymy, antonymy, hypernymy, and hyponymy. We combine linear probing with mechanistic interpretability techniques, including sparse autoencoders (SAE) and activation patching, to identify where these relations are encoded and how specific features contribute to their representation. Our results reveal a directional asymmetry in hierarchical relations: hypernymy is encoded redundantly and resists suppression, while hyponymy relies on compact features that are more easily disrupted by ablation. More broadly, relation signals are diffuse but exhibit stable profiles: they peak in the mid-layers and are stronger in post-residual/MLP pathways than in attention. Difficulty is consistent across models (antonymy easiest, synonymy hardest). Probe-level causality is capacity-dependent: on Llama 3.1, SAE-guided patching reliably shifts these signals, whereas on smaller models the shifts are weak or unstable. Our results clarify where and how reliably semantic relations are represented inside LLMs, and provide a reproducible framework for relating sparse features to probe-level causal evidence.