Reasoning Under Pressure: How do Training Incentives Influence Chain-of-Thought Monitorability?

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates how training incentives affect the monitorability of chain-of-thought (CoT) reasoning in large language models (LLMs). Addressing the critical safety question—“Does CoT faithfully reflect internal reasoning?”—we propose a quantification method for monitorability based on latent-variable prediction, formally defining monitorability as the prediction accuracy of key latent states along the reasoning path. Through systematic empirical evaluation of prevalent training strategies—including length penalties, KL regularization, and adversarial optimization—we find: (i) standard regularizers exhibit no consistent effect on monitorability; (ii) adversarial optimization significantly degrades monitoring performance; and (iii) direct optimization of monitorability objectives fails to yield stable improvements. Our results uncover a fundamental tension between reasoning transparency and conventional training objectives, revealing inherent trade-offs in aligning model behavior with human-interpretability requirements. This work provides both theoretical grounding and empirical benchmarks for developing verifiable, auditable, and trustworthy AI systems.

Technology Category

Application Category

📝 Abstract
AI systems that output their reasoning in natural language offer an opportunity for safety -- we can emph{monitor} their chain of thought (CoT) for undesirable reasoning, such as the pursuit of harmful objectives. However, the extent to which CoT faithfully reflects the underlying reasoning process, and hence the extent to which it can be usefully monitored, may be influenced by certain aspects of training. We investigate how different emph{training incentives}, applied to a reasoning model, affect its monitorability. We introduce a novel methodology for measuring monitorability according to whether a monitor can predict a key latent variable using the model's reasoning. When controlling for accuracy, we do not find evidence for consistent effects from commonly used incentives (length penalties and KL regularisation), but we find that adversarial optimisation (penalising monitor accuracy) degrades monitor performance, while direct optimisation for monitorability does not reliably lead to improvements. Our code is available at https://github.com/QiyaoWei/reasoning-under-pressure.
Problem

Research questions and friction points this paper is trying to address.

Investigates how training incentives affect chain-of-thought monitorability in AI systems
Measures monitorability by predicting latent variables from reasoning outputs
Finds adversarial optimization degrades monitor performance, unlike common incentives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel methodology measures monitorability via latent variable prediction
Adversarial optimization penalizing monitor accuracy degrades performance
Direct optimization for monitorability does not reliably improve results
🔎 Similar Papers
No similar papers found.