Faithfulness as Information Flow: Evaluating and Training Faithful Chain-of-Thought Reasoning

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the unreliability of Chain-of-Thought (CoT) reasoning, which often stems from models exploiting shortcuts in prompts rather than engaging in genuine logical inference. To tackle this, the authors propose a task-agnostic framework for evaluating CoT faithfulness grounded in an information-flow perspective, diagnosing reasoning trajectories through three structural properties: sufficiency, completeness, and necessity. Leveraging entropy, masked KL divergence, and gradient-based analyses, the method identifies shortcut behaviors and introduces targeted training interventions—including attention masking, reversed gradient masking, CoT-aware gradient optimization, and adversarial perturbations of prompt representations. Experiments demonstrate that the approach significantly strengthens the mediating role of CoT across multiple tasks, enhances reasoning faithfulness, renders shortcut exploitation and reward hacking more transparent, and reduces model sensitivity to misleading prompts.

📝 Abstract

Chain-of-thought (CoT) reasoning is useful for monitoring language models only when the reasoning trace faithfully reflects the computation that produces the final answer. However, models can rely on prompt-to-answer shortcuts that bypass the CoT, making the visible reasoning trace misleading even when it appears plausible. We study CoT faithfulness through a structural information-flow perspective: faithful reasoning should route answer-relevant information through the mediated path from prompt to CoT to answer, rather than through a direct prompt-to-answer shortcut. This perspective yields a task-agnostic framework based on three complementary properties, sufficiency, completeness, and necessity, which we instantiate with entropy-based, masked-KL, and gradient-based diagnostics. We show that these metrics recover externally judged faithfulness differences in hinted reasoning, and identify a low-entropy failure mode of KL-based diagnostics where gradient-based measures remain more stable. Building on this analysis, we introduce update-time interventions for verifier-based on-policy RL, including attention masking, backward-only gradient masking, CoT gradients, and adversarial perturbations of prompt representations. Across hinted arithmetic, reward-hackable code repair, and DAPO-Math models trained without hints but evaluated under wrong-hint injection, our interventions shift behavioral and structural indicators toward stronger CoT mediation. In particular, they make shortcut and reward-hacking behavior more transparent in the CoT and improve task-agnostic faithfulness metrics, while in some settings also reducing wrong-hint susceptibility. Our results suggest that controlling information flow during training is a practical route toward more faithful and monitorable CoT reasoning. Code is available at https://github.com/safety-research/faithful-cot.

Problem

Research questions and friction points this paper is trying to address.

faithfulness

chain-of-thought

information flow

reasoning trace

shortcuts

Innovation

Methods, ideas, or system contributions that make the work stand out.

faithful reasoning

information flow

chain-of-thought