Reasoning Under Pressure: How do Training Incentives Influence Chain-of-Thought Monitorability?

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work investigates how training incentives affect the monitorability of chain-of-thought (CoT) reasoning in large language models (LLMs). Addressing the critical safety question—“Does CoT faithfully reflect internal reasoning?”—we propose a quantification method for monitorability based on latent-variable prediction, formally defining monitorability as the prediction accuracy of key latent states along the reasoning path. Through systematic empirical evaluation of prevalent training strategies—including length penalties, KL regularization, and adversarial optimization—we find: (i) standard regularizers exhibit no consistent effect on monitorability; (ii) adversarial optimization significantly degrades monitoring performance; and (iii) direct optimization of monitorability objectives fails to yield stable improvements. Our results uncover a fundamental tension between reasoning transparency and conventional training objectives, revealing inherent trade-offs in aligning model behavior with human-interpretability requirements. This work provides both theoretical grounding and empirical benchmarks for developing verifiable, auditable, and trustworthy AI systems.

Technology Category

Application Category

📝 Abstract

AI systems that output their reasoning in natural language offer an opportunity for safety -- we can emph{monitor} their chain of thought (CoT) for undesirable reasoning, such as the pursuit of harmful objectives. However, the extent to which CoT faithfully reflects the underlying reasoning process, and hence the extent to which it can be usefully monitored, may be influenced by certain aspects of training. We investigate how different emph{training incentives}, applied to a reasoning model, affect its monitorability. We introduce a novel methodology for measuring monitorability according to whether a monitor can predict a key latent variable using the model's reasoning. When controlling for accuracy, we do not find evidence for consistent effects from commonly used incentives (length penalties and KL regularisation), but we find that adversarial optimisation (penalising monitor accuracy) degrades monitor performance, while direct optimisation for monitorability does not reliably lead to improvements. Our code is available at https://github.com/QiyaoWei/reasoning-under-pressure.

Problem

Research questions and friction points this paper is trying to address.

Investigates how training incentives affect chain-of-thought monitorability in AI systems

Measures monitorability by predicting latent variables from reasoning outputs

Finds adversarial optimization degrades monitor performance, unlike common incentives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel methodology measures monitorability via latent variable prediction

Adversarial optimization penalizing monitor accuracy degrades performance

Direct optimization for monitorability does not reliably improve results

🔎 Similar Papers

Rational Metareasoning for Large Language Models