🤖 AI Summary
This study investigates whether chain-of-thought (CoT) reasoning genuinely encodes task-relevant information and whether correct answers can still be recovered from erroneous CoT sequences. Leveraging activation patching, the authors transfer token-level hidden states from the CoT generation process into a direct-answer model on GSM8K and analyze their causal impact on answer accuracy. They provide the first causal evidence that even when CoT reasoning is incorrect, the hidden state of a single token can suffice to recover the correct answer. Task-relevant information is unevenly distributed across token types and network layers, rendering full reasoning chains unnecessary. Experiments show that patched direct-answer models significantly outperform both original CoT and baseline approaches; linguistic tokens prove more informative for reasoning than mathematical ones, and shorter outputs often yield superior performance.
📝 Abstract
Whether intermediate reasoning is computationally useful or merely explanatory depends on whether chain-of-thought (CoT) tokens contain task-relevant information. We present a mechanistic causal analysis of CoT on GSM8K using activation patching: transferring token-level hidden states from a CoT generation to a direct-answer run for the same question, then measuring the effect on final-answer accuracy. Across models, generating after patching yields substantially higher accuracy than both direct-answer prompting and the original CoT trace, revealing that individual CoT tokens can encode sufficient information to recover the correct answer, even when the original trace is incorrect. This task-relevant information is more prevalent in correct than incorrect CoT runs and is unevenly distributed across tokens, concentrating in mid-to-late layers and appearing earlier in the reasoning trace. Moreover, patching language tokens such as verbs and entities carry task-solving information that steers generation toward correct reasoning, whereas mathematical tokens encode answer-proximal content that rarely succeeds. Patched outputs are often shorter and yet exceed the accuracy of a full CoT trace, suggesting complete reasoning chains are not always necessary. Together, these findings demonstrate that CoT encodes recoverable, token-level problem-solving information, offering new insight into how reasoning is represented and where it breaks down.