When and How Long? The Readout-Mediator Angle in Temporal Reasoning

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work demonstrates that linear probes in language model interpretability may learn directions unaligned with the model’s actual computations, leading to misleading behavioral interpretations. Introducing the concept of the “readout–mediator angle,” the study systematically reveals—through geometric analysis of probe directions relative to causal subspaces in calendar date reasoning tasks—that probe directions are often orthogonal to the true computational pathways, representing a pervasive failure mode. By integrating Distributed Alignment Search, attention head analysis, MLP functional decomposition, and sparse autoencoders, the authors replicate this phenomenon across four model scales and two architectural families, and provide preliminary evidence of its generality in spatial translation and symbolic arithmetic tasks. These findings challenge the reliability of linear probing as a runtime safety monitoring tool.

📝 Abstract

A linear probe can decode a representation almost perfectly and yet be completely irrelevant to how the model uses it. On calendar-date duration reasoning in language models, a $\sin$/$\cos$ probe recovers day-of-year from a layer's activations, yet ablating its direction has no effect on the model's answers -- while ablating a four-dimensional subspace found by Distributed Alignment Search (DAS) at the same layer collapses performance entirely. We measure the angle between these two subspaces -- the \emph{readout-mediator angle} -- and find it indistinguishable from the angle between two random subspaces (the Haar-uniform null), meaning the probe has learned a direction orthogonal to the model's actual computation. Reverse-engineering the circuit reveals why: attention heads route month-grained context through learned QK offsets at ${\pm}30$ and ${\pm}61$ days, and MLPs then convert \emph{when} (absolute date) into \emph{how long} (duration) -- all downstream of the causal subspace the probe never touches. Sparse-autoencoder decomposition confirms the split: probe-aligned and DAS-aligned features encode semantically disjoint concepts with negligible causal overlap. The dissociation replicates across four scales ($1.5$-$9\,$B) and two model families, with preliminary evidence on two further domains (spatial displacement, symbolic arithmetic), suggesting that readout-mediator orthogonality is a general failure mode of probe-based interpretability. This directly undermines proposals to deploy probes as runtime safety monitors: the probe can report high confidence on a direction the model has silently abandoned.

Problem

Research questions and friction points this paper is trying to address.

probing

interpretability

readout-mediator angle

causal representation

language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

readout-mediator angle

probing interpretability

causal representation