Monitorability as a Free Gift: How RLVR Spontaneously Aligns Reasoning

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This study investigates the auditability of chain-of-thought (CoT) reasoning in large reasoning models, which hinges on their monitorability—i.e., whether CoT faithfully and comprehensively reflects internal computational processes. Focusing on reinforcement learning with verifiable rewards (RLVR), we systematically examine how monitorability evolves during training. Through controlled training and evaluation protocols, mechanistic analysis, and attention dynamics tracking, we find that monitorability is not an inherent outcome of RLVR but critically depends on the diversity of training data and the inclusion of instruction-following examples. Moreover, improvements in monitorability are orthogonal to gains in reasoning ability and are primarily driven by reduced entropy in response distributions and heightened attention to prompts. This work establishes, for the first time, the conditions and mechanisms underlying monitorability, offering a novel pathway toward safe alignment.

Technology Category

Application Category

📝 Abstract

As Large Reasoning Models (LRMs) are increasingly deployed, auditing their chain-of-thought (CoT) traces for safety becomes critical. Recent work has reported that monitorability--the degree to which CoT faithfully and informatively reflects internal computation--can appear as a"free gift"during the early stages of Reinforcement Learning with Verifiable Rewards (RLVR). We make this observation concrete through a systematic evaluation across model families and training domains. Our results show that this effect is not universal: monitorability improvements are strongly data-dependent. In particular, we demonstrate the critical role of data diversity and instruction-following data during RLVR training. We further show that monitorability is orthogonal to capability--improvements in reasoning performance do not imply increased transparency. Through mechanistic analysis, we attribute monitorability gains primarily to response distribution sharpening (entropy reduction) and increased attention to the prompt, rather than stronger causal reliance on reasoning traces. We also reveal how monitorability dynamics vary with controlled training and evaluation difficulty. Together, these findings provide a holistic view of how monitorability emerges under RLVR, clarifying when gains are likely to occur and when they are not.

Problem

Research questions and friction points this paper is trying to address.

monitorability

chain-of-thought

Large Reasoning Models

RLVR

model transparency

Innovation

Methods, ideas, or system contributions that make the work stand out.

monitorability

Reinforcement Learning with Verifiable Rewards

chain-of-thought