🤖 AI Summary
This work investigates how training incentives affect the monitorability of chain-of-thought (CoT) reasoning in large language models (LLMs). Addressing the critical safety question—“Does CoT faithfully reflect internal reasoning?”—we propose a quantification method for monitorability based on latent-variable prediction, formally defining monitorability as the prediction accuracy of key latent states along the reasoning path. Through systematic empirical evaluation of prevalent training strategies—including length penalties, KL regularization, and adversarial optimization—we find: (i) standard regularizers exhibit no consistent effect on monitorability; (ii) adversarial optimization significantly degrades monitoring performance; and (iii) direct optimization of monitorability objectives fails to yield stable improvements. Our results uncover a fundamental tension between reasoning transparency and conventional training objectives, revealing inherent trade-offs in aligning model behavior with human-interpretability requirements. This work provides both theoretical grounding and empirical benchmarks for developing verifiable, auditable, and trustworthy AI systems.
📝 Abstract
AI systems that output their reasoning in natural language offer an opportunity for safety -- we can emph{monitor} their chain of thought (CoT) for undesirable reasoning, such as the pursuit of harmful objectives. However, the extent to which CoT faithfully reflects the underlying reasoning process, and hence the extent to which it can be usefully monitored, may be influenced by certain aspects of training. We investigate how different emph{training incentives}, applied to a reasoning model, affect its monitorability. We introduce a novel methodology for measuring monitorability according to whether a monitor can predict a key latent variable using the model's reasoning. When controlling for accuracy, we do not find evidence for consistent effects from commonly used incentives (length penalties and KL regularisation), but we find that adversarial optimisation (penalising monitor accuracy) degrades monitor performance, while direct optimisation for monitorability does not reliably lead to improvements. Our code is available at https://github.com/QiyaoWei/reasoning-under-pressure.