A Pragmatic Way to Measure Chain-of-Thought Monitorability

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Assessing the monitorability of chain-of-thought (CoT) reasoning in large language models (LLMs) remains challenging under architectural or training modifications. Method: We propose the first quantitative framework that decouples monitorability into two orthogonal dimensions: readability (human interpretability) and coverage (reasoning completeness). We design a general-purpose, zero-shot auto-rater prompt enabling strong LLMs to evaluate CoT quality automatically. Validated via synthetic degradation experiments, our framework is empirically evaluated across state-of-the-art models and complex benchmarks (e.g., GSM8K, MMLU). Results: Current mainstream CoT models exhibit high monitorability. Our framework precisely quantifies how model design choices affect monitorability, enabling fine-grained, reproducible, and scalable assessment. This provides a foundational tool for enhancing AI transparency, interpretability, and safety-controllability.

Technology Category

Application Category

📝 Abstract

While Chain-of-Thought (CoT) monitoring offers a unique opportunity for AI safety, this opportunity could be lost through shifts in training practices or model architecture. To help preserve monitorability, we propose a pragmatic way to measure two components of it: legibility (whether the reasoning can be followed by a human) and coverage (whether the CoT contains all the reasoning needed for a human to also produce the final output). We implement these metrics with an autorater prompt that enables any capable LLM to compute the legibility and coverage of existing CoTs. After sanity-checking our prompted autorater with synthetic CoT degradations, we apply it to several frontier models on challenging benchmarks, finding that they exhibit high monitorability. We present these metrics, including our complete autorater prompt, as a tool for developers to track how design decisions impact monitorability. While the exact prompt we share is still a preliminary version under ongoing development, we are sharing it now in the hopes that others in the community will find it useful. Our method helps measure the default monitorability of CoT - it should be seen as a complement, not a replacement, for the adversarial stress-testing needed to test robustness against deliberately evasive models.

Problem

Research questions and friction points this paper is trying to address.

Measuring Chain-of-Thought legibility and coverage for AI safety monitoring

Developing automated metrics to evaluate reasoning transparency in language models

Tracking how model design decisions impact Chain-of-Thought monitorability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Measures CoT legibility and coverage metrics

Implements autorater prompt for LLM evaluation

Tracks design impact on monitorability

🔎 Similar Papers

No similar papers found.