🤖 AI Summary
This study investigates whether chain-of-thought (CoT) reasoning traces faithfully reflect a model’s actual internal decision-making process, thereby questioning their reliability as a supervisory and auditing mechanism. To this end, the authors propose a step-level Detect-Classify-Compare framework, integrating multidimensional validation techniques—including answer-commitment agents, Patchscopes, tuned-lens probes, causal ablation, truncation experiments, and donor contamination tests. Experiments across nine models and seven reasoning benchmarks reveal that, on average, only 61.9% of CoT steps align with the model’s internal computations; in 58% of misaligned cases, models generate redundant “reasoning” after the answer has already been determined—a phenomenon termed “hallucinated continuation.” Notably, stronger CoT performance correlates with lower temporal fidelity. This work provides the first systematic evidence of a fundamental disconnect between CoT traces and genuine reasoning dynamics, challenging the core assumption that CoT serves as a faithful reasoning log.
📝 Abstract
Chain-of-thought (CoT) traces are increasingly used both to improve language model capability and to audit model behavior, implicitly assuming that the visible trace remains synchronized with the computation that determines the answer. We test this assumption with a step-level Detect-Classify-Compare framework built around an answer-commitment proxy that is cross-validated with Patchscopes, tuned-lens probes, and causal direction ablation. Across nine models and seven reasoning benchmarks, latent commitment and explicit answer arrival align on only 61.9% of steps on average. The dominant mismatch pattern is confabulated continuation: 58.0% of detected mismatch events occur after the answer-commitment proxy has already stabilized while the trace continues producing deliberative-looking text, and a vacuousness analysis shows that the committed answer does not change during these steps. In architecture-matched Qwen2.5/DeepSeek-R1-Distill comparisons, the reasoning pipeline changes failure composition more than aggregate alignment, most clearly at 32B where confabulated steps decrease as contradictory states increase. Lower step-level alignment is also associated with larger CoT utility, suggesting that the settings that benefit most from CoT are often the least temporally faithful. Paired truncation and a complementary donor-corruption test further indicate that much post-commitment text is not load-bearing for the final answer. These findings suggest that CoT can remain useful while still being an unreliable report of when the answer was formed.