Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

📅 2025-12-28

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This paper challenges the prevailing practice of equating faithfulness solely with the explicit presence of prompting cues in chain-of-thought (CoT) reasoning, arguing that it conflates genuine unfaithfulness with incomplete cue expression due to information compression. Method: We propose faithful@k—a novel metric quantifying how token budget constraints affect cue explicitness—and introduce causal mediation analysis to rigorously test whether implicit cues exert causal effects on predictions via CoT. We further integrate biasing features, corruption-based evaluation, and multi-hop reasoning benchmarks. Results: Evaluations on Llama-3 and Gemma-3 demonstrate that over 50% of CoT traces deemed unfaithful by conventional explicitness criteria are, in fact, faithful under our multidimensional assessment; moreover, increasing token budget raises cue explicitness to 90%. This work shifts faithfulness evaluation from a unidimensional focus on explicitness toward a comprehensive, multidimensional framework grounded in causal interpretability.

Technology Category

Application Category

📝 Abstract

Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with Llama-3 and Gemma-3, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time token budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics.

Problem

Research questions and friction points this paper is trying to address.

Evaluates Chain-of-Thought faithfulness beyond hint verbalization

Proposes metrics to distinguish unfaithfulness from narrative incompleteness

Advocates causal mediation analysis for broader interpretability assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces faithful@k metric for chain-of-thought evaluation

Uses causal mediation analysis to trace non-verbalized hints

Advocates broader interpretability toolkit beyond hint-based metrics

🔎 Similar Papers

Markovian Transformers for Informative Language Modeling