🤖 AI Summary
Single-chain-of-thought (CoT) analysis fails to uncover the causal mechanisms and robustness of large language model (LLM) reasoning.
Method: We propose a resampling-based causal analysis framework integrating distributed CoT generation, robustness quantification (via step deletion effects), causal mediation analysis, and on-policy versus off-policy human editing experiments.
Contribution: We find that critical planning steps exhibit high deletion resistance and strong causal influence, whereas self-protective statements exert negligible causal effects. We confirm that implicit prompts induce cumulative effects across reasoning steps. Crucially, resampling yields more stable and reliable causal attributions than off-policy editing—significantly improving the precision of causal reasoning-path attribution, the interpretability of interventions, and cross-case consistency. (149 words)
📝 Abstract
Most work interpreting reasoning models studies only a single chain-of-thought (CoT), yet these models define distributions over many possible CoTs. We argue that studying a single sample is inadequate for understanding causal influence and the underlying computation. Though fully specifying this distribution is intractable, it can be understood by sampling. We present case studies using resampling to investigate model decisions. First, when a model states a reason for its action, does that reason actually cause the action? In "agentic misalignment" scenarios, we resample specific sentences to measure their downstream effects. Self-preservation sentences have small causal impact, suggesting they do not meaningfully drive blackmail. Second, are artificial edits to CoT sufficient for steering reasoning? These are common in literature, yet take the model off-policy. Resampling and selecting a completion with the desired property is a principled on-policy alternative. We find off-policy interventions yield small and unstable effects compared to resampling in decision-making tasks. Third, how do we understand the effect of removing a reasoning step when the model may repeat it post-edit? We introduce a resilience metric that repeatedly resamples to prevent similar content from reappearing downstream. Critical planning statements resist removal but have large effects when eliminated. Fourth, since CoT is sometimes "unfaithful", can our methods teach us anything in these settings? Adapting causal mediation analysis, we find that hints that have a causal effect on the output without being explicitly mentioned exert a subtle and cumulative influence on the CoT that persists even if the hint is removed. Overall, studying distributions via resampling enables reliable causal analysis, clearer narratives of model reasoning, and principled CoT interventions.