🤖 AI Summary
This study investigates how Llama-3, after performing cross-token information routing, can produce the result of three-digit addition relying solely on the final input token. Through causal residual patching and cumulative attention ablation, the authors identify a post-routing boundary around layer 17, beyond which the addition output is governed by the last token and structured numerical directions residing in a low-rank subspace. The work presents the first evidence that numerical directions across different carry contexts are related via approximately orthogonal rotational mappings, enabling precise counterfactual digit editing through rotation-based interventions. Experiments demonstrate that late self-attention layers can be omitted without performance loss, and that cross-context editing capability can be restored via low-rank Procrustes alignment and directional manipulation. Negative control experiments fail to reproduce this effect, confirming the specificity of the identified mechanism.
📝 Abstract
We study three-digit addition in Meta-Llama-3-8B (base) under a one-token readout to characterize how
arithmetic answers are finalized after cross-token routing becomes causally irrelevant.
Causal residual patching and cumulative attention ablations localize a sharp boundary near layer~17:
beyond it, the decoded sum is controlled almost entirely by the last input token and late-layer self-attention
is largely dispensable.
In this post-routing regime, digit(-sum) direction dictionaries vary with a next-higher-digit context but are
well-related by an approximately orthogonal map inside a shared low-rank subspace (low-rank Procrustes alignment).
Causal digit editing matches this geometry: naive cross-context transfer fails, while rotating directions through the
learned map restores strict counterfactual edits; negative controls do not recover.