๐ค AI Summary
This paper challenges the prevailing practice of equating faithfulness solely with the explicit presence of prompting cues in chain-of-thought (CoT) reasoning, arguing that it conflates genuine unfaithfulness with incomplete cue expression due to information compression. Method: We propose faithful@kโa novel metric quantifying how token budget constraints affect cue explicitnessโand introduce causal mediation analysis to rigorously test whether implicit cues exert causal effects on predictions via CoT. We further integrate biasing features, corruption-based evaluation, and multi-hop reasoning benchmarks. Results: Evaluations on Llama-3 and Gemma-3 demonstrate that over 50% of CoT traces deemed unfaithful by conventional explicitness criteria are, in fact, faithful under our multidimensional assessment; moreover, increasing token budget raises cue explicitness to 90%. This work shifts faithfulness evaluation from a unidimensional focus on explicitness toward a comprehensive, multidimensional framework grounded in causal interpretability.
๐ Abstract
Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with Llama-3 and Gemma-3, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time token budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics.