🤖 AI Summary
This work addresses a critical limitation in current large language model (LLM) reasoning research, which overly relies on surface-level chain-of-thought (CoT) traces while neglecting the role of latent state trajectories, leading to biased interpretability analyses, evaluations, and intervention strategies. The paper reconceptualizes LLM reasoning as the emergence of latent state trajectories and formally distinguishes and articulates three competing hypotheses about the nature of reasoning, advocating for latent state trajectories as the default object of study. Through a combination of matched computational budget scaling, latent interventions, and surface trace decomposition—supported by empirical analysis, mechanistic investigation, and computational auditing—the study demonstrates the superiority of the latent state trajectory hypothesis (H1), establishing a new paradigm for evaluating and intervening in LLM reasoning processes.
📝 Abstract
This position paper argues that large language model (LLM) reasoning should be studied as latent-state trajectory formation rather than as faithful surface chain-of-thought (CoT). This matters because claims about faithfulness, interpretability, reasoning benchmarks, and inference-time intervention all depend on what the field takes the primary object of reasoning to be. We ask what that object should be once three often-confounded factors are separated and formalize three competing hypotheses: H1, reasoning is primarily mediated by latent-state trajectories; H2, reasoning is primarily mediated by explicit surface CoT; and H0, most apparent reasoning gains are better explained by generic serial compute than by any privileged representational object. Reorganizing recent empirical, mechanistic, and survey work under this framework, and adding compute-audited worked exemplars that factorize surface traces, latent interventions, and matched budget expansions, we find that current evidence most strongly supports H1 as a default working hypothesis rather than as a task-independent verdict. We therefore make two recommendations: the field should treat latent-state dynamics as the default object of study for LLM reasoning, and it should evaluate reasoning with designs that explicitly disentangle surface traces, latent states, and serial compute.