Is Inference Mediated by Distinct Semantic Structures in LLMs? A Mechanistic Interpretation

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This study investigates whether large language models encode the semantic operations required for natural language inference rather than merely memorizing labels. To this end, the authors construct premise–hypothesis pairs involving only a single semantic transformation and employ interpretability techniques—including singular value decomposition (SVD) to identify operation-specific subspaces, layer-wise activation analysis, and activation steering. Their approach reveals, for the first time, that semantic operations are represented within the model as partially independent yet overlapping subspaces. Experimental results demonstrate that classifiers trained on these subspaces achieve 84.8%–99% accuracy in decoding semantic transformation effects, substantially outperforming random baselines. Furthermore, activation steering confirms the causal influence of these subspaces on model predictions and uncovers structured interference across different operations.

📝 Abstract

Predicting a label correctly does not necessarily require representing the operation that produces it. Transformer representations are known to carry label-level information, but whether they encode semantic operations producing those labels is unclear. We investigate this in Natural Language Inference using controlled premise-hypothesis pairs that differ by a single semantic transformation. Using layer-wise activations, we estimate operation-level subspaces via SVD and test their causal relevance through activation steering in four open-weight decoder models. Transformation effects are decodable with $84.8$-$99\%$ accuracy and occupy partially distinct but overlapping subspaces, exceeding random-subspace baselines. Steering experiments show that these directions causally influence predictions, though steerability varies across models; cross-operation steering further reveals structured interference and a dissociation between subspace selectivity and cross-operation independence. These findings indicate that the models encode not only that a hypothesis relates to a premise but also, in part, how it does so, implying that mechanistic analysis and control should operate at the level of semantic operations rather than predicted labels alone.

Problem

Research questions and friction points this paper is trying to address.

semantic operations

natural language inference

mechanistic interpretation

transformer representations

causal relevance

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic operations

activation steering

mechanistic interpretability