MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the frequent lack of causal alignment between chain-of-thought (CoT) reasoning generated by large language models and their final outputs, which undermines the interpretability and monitorability of model decisions. To this end, the paper introduces MonitorBench, the first comprehensive and open-source benchmark for evaluating CoT monitorability, comprising 1,514 annotated instances across 19 tasks and 7 scenario categories, along with two stress-testing mechanisms to quantify monitorability degradation. Experimental results reveal a negative correlation between model capability and monitorability, demonstrate that structured reasoning substantially enhances monitorability, and show that closed-source models generally underperform in this regard. Under stress testing, monitorability degrades by up to 30% on certain tasks.
📝 Abstract
Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the decision-critical factors driving the model's behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open-source benchmark for studying CoT monitorability remains lacking. To address this gap, we propose MonitorBench, a systematic benchmark for evaluating CoT monitorability in LLMs. MonitorBench provides: (1) a diverse set of 1,514 test instances with carefully designed decision-critical factors across 19 tasks spanning 7 categories to characterize when CoTs can be used to monitor the factors driving LLM behavior; and (2) two stress-test settings to quantify the extent to which CoT monitorability can be degraded. Extensive experiments across multiple popular LLMs with varying capabilities show that CoT monitorability is higher when producing the final target response requires structural reasoning through the decision-critical factor. Closed-source LLMs generally show lower monitorability, and there exists a negative relationship between monitorability and model capability. Moreover, both open- and closed-source LLMs can intentionally reduce monitorability under stress-tests, with monitorability dropping by up to 30% in some tasks that do not require structural reasoning over the decision-critical factors. Beyond these empirical insights, MonitorBench provides a basis for further research on evaluating future LLMs, studying advanced stress-test monitorability techniques, and developing new monitoring approaches.
Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought
Monitorability
Large Language Models
Decision-Critical Factors
Benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought Monitorability
Large Language Models
Benchmarking
Stress Testing
Decision-Critical Factors
🔎 Similar Papers
No similar papers found.
H
Han Wang
University of Illinois Urbana-Champaign
Y
Yifan Sun
University of Illinois Urbana-Champaign
B
Brian Ko
University of Washington
M
Mann Talati
University of Illinois Urbana-Champaign
J
Jiawen Gong
University of Illinois Urbana-Champaign
Z
Zimeng Li
University of Illinois Urbana-Champaign
N
Naicheng Yu
University of California San Diego
X
Xucheng Yu
University of Illinois Urbana-Champaign
W
Wei Shen
University of Illinois Urbana-Champaign
V
Vedant Jolly
University of Illinois Urbana-Champaign
Huan Zhang
Huan Zhang
Assistant Professor at University of Illinois Urbana-Champaign
Machine learningFormal VerificationTrustworthy AIMachine Learning SecurityOptimization