Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the problem of quantifying the monitorability of chain-of-thought (CoT) reasoning in large language models—i.e., how faithfully and completely CoT outputs reflect the model’s internal reasoning process. We propose a novel *monitorability score*, the first metric to jointly formalize faithfulness and completeness, grounded in a working-memory–informed characterization of CoT transparency. Using the Inspection library, we conduct empirical evaluation across BBH, GPQA, and MMLU benchmarks with both instruction-tuned and reasoning-specialized models. Results reveal that models frequently exhibit superficial faithfulness while omitting critical reasoning steps—undermining effective monitoring—and that monitorability varies significantly across model families. To foster reproducible research, we open-source our evaluation framework.

Technology Category

Application Category

📝 Abstract
Chain-of-thought (CoT) outputs let us read a model's step-by-step reasoning. Since any long, serial reasoning process must pass through this textual trace, the quality of the CoT is a direct window into what the model is thinking. This visibility could help us spot unsafe or misaligned behavior (monitorability), but only if the CoT is transparent about its internal reasoning (faithfulness). Fully measuring faithfulness is difficult, so researchers often focus on examining the CoT in cases where the model changes its answer after adding a cue to the input. This proxy finds some instances of unfaithfulness but loses information when the model maintains its answer, and does not investigate aspects of reasoning not tied to the cue. We extend these results to a more holistic sense of monitorability by introducing verbosity: whether the CoT lists every factor needed to solve the task. We combine faithfulness and verbosity into a single monitorability score that shows how well the CoT serves as the model's external `working memory', a property that many safety schemes based on CoT monitoring depend on. We evaluate instruction-tuned and reasoning models on BBH, GPQA, and MMLU. Our results show that models can appear faithful yet remain hard to monitor when they leave out key factors, and that monitorability differs sharply across model families. We release our evaluation code using the Inspect library to support reproducible future work.
Problem

Research questions and friction points this paper is trying to address.

Evaluating faithfulness of chain-of-thought reasoning in language models
Measuring verbosity to assess completeness of reasoning steps
Developing holistic monitorability score for model safety evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining faithfulness and verbosity for monitorability
Introducing verbosity to assess reasoning completeness
Creating a unified score for chain-of-thought evaluation
🔎 Similar Papers
No similar papers found.