🤖 AI Summary
Chain-of-thought (CoT) reasoning often suffers from low faithfulness, failing to accurately reflect the true decision-making rationale of large language models and thereby limiting their interpretability and reliability. This work proposes Counterfactual Simulation Training (CST), the first approach to integrate counterfactual simulation into CoT training: it generates counterfactual inputs based on prompts, evaluates the predictive accuracy of CoT explanations using a simulator, and optimizes the process through LLM-based rewriting and reinforcement learning. CST effectively mitigates reliance on spurious features, reward hacking, and sycophantic behaviors, significantly enhancing reasoning faithfulness and generalization. Evaluated on models up to 235B parameters, CST improves cue-based counterfactual monitoring accuracy by 35 percentage points and boosts general counterfactual simulatability by 2 percentage points, outperforming all existing prompting baselines.
📝 Abstract
Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output. But well-known problems with CoT faithfulness severely limit what insights can be gained from this practice. In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model's outputs over counterfactual inputs. We apply CST in two settings: (1) CoT monitoring with cue-based counterfactuals, to detect when models rely on spurious features, reward hack, or are sycophantic, and (2) counterfactual simulation over generic model-based counterfactuals, to encourage models to produce more faithful, generalizable reasoning in the CoT. Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on cue-based counterfactuals (by 35 accuracy points) as well as simulatability over generic counterfactuals (by 2 points). We further show that: (1) CST outperforms prompting baselines, (2) rewriting unfaithful CoTs with an LLM is 5x more efficient than RL alone, (3) faithfulness improvements do not generalize to dissuading cues (as opposed to persuading cues), and (4) larger models do not show more faithful CoT out of the box, but they do benefit more from CST. These results suggest that CST can improve CoT faithfulness in general, with promising applications for CoT monitoring. Code for experiments in this paper is available at https://github.com/peterbhase/counterfactual-simulation-training