Truth as a Trajectory: What Internal Representations Reveal About Large Language Model Reasoning

📅 2026-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing interpretability methods for large language models, which rely on static activations and are often confounded by superficial lexical features, thereby failing to uncover genuine reasoning mechanisms. The authors propose modeling the reasoning process as a trajectory of iterative layer-wise optimization, analyzing geometric displacements across layers rather than static snapshots to identify geometric invariants underlying reasoning. This approach introduces a trajectory-based perspective that replaces conventional probing techniques, enabling activation-free interpretability analysis in both dense and mixture-of-experts (MoE) architectures. Evaluated on tasks including commonsense reasoning, question answering, and toxicity detection, the method significantly outperforms current approaches and effectively mitigates lexical confounding, demonstrating the efficacy of trajectory modeling in revealing the true structural dynamics of model reasoning.

Technology Category

Application Category

📝 Abstract
Existing explainability methods for Large Language Models (LLMs) typically treat hidden states as static points in activation space, assuming that correct and incorrect inferences can be separated using representations from an individual layer. However, these activations are saturated with polysemantic features, leading to linear probes learning surface-level lexical patterns rather than underlying reasoning structures. We introduce Truth as a Trajectory (TaT), which models the transformer inference as an unfolded trajectory of iterative refinements, shifting analysis from static activations to layer-wise geometric displacement. By analyzing displacement of representations across layers, TaT uncovers geometric invariants that distinguish valid reasoning from spurious behavior. We evaluate TaT across dense and Mixture-of-Experts (MoE) architectures on benchmarks spanning commonsense reasoning, question answering, and toxicity detection. Without access to the activations themselves and using only changes in activations across layers, we show that TaT effectively mitigates reliance on static lexical confounds, outperforming conventional probing, and establishes trajectory analysis as a complementary perspective on LLM explainability.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
explainability
hidden states
reasoning
polysemantic features
Innovation

Methods, ideas, or system contributions that make the work stand out.

trajectory analysis
geometric displacement
layer-wise dynamics
LLM interpretability
reasoning invariants
🔎 Similar Papers
No similar papers found.
H
Hamed Damirchi
Australian Institute for Machine Learning, Adelaide University
I
Ignacio Meza De la Jara
Australian Institute for Machine Learning, Adelaide University
Ehsan Abbasnejad
Ehsan Abbasnejad
Assoc. Prof. Monash University
Machine learningResponsible machine learningVision and LanguageMachine ReasoningBayesian
A
Afshar Shamsi
Concordia University
Zhen Zhang
Zhen Zhang
The University of Adelaide
CausationProbabilistic Graphical ModelsProbabilistic InferenceGraph Neural Networks
J
Javen Shi
Australian Institute for Machine Learning, Adelaide University