Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
This work addresses the inconsistency between the chain-of-thought (CoT) reasoning process and final outputs in large reasoning models (LRMs), which undermines their reliability for safety monitoring. To this end, the authors propose a Probe Trajectory framework that evaluates probes at every generated token to track the continuous evolution of concept probabilities and predict future model behavior. The method innovatively incorporates signal processing features—such as volatility, trend, and steady-state characteristics—to characterize reasoning dynamics. Notably, the study finds that template-based training data can effectively substitute costly dynamically generated data, and reveals that max-pooling is crucial for trajectory stability. Experiments across four datasets and four models demonstrate that the approach substantially improves future state discriminability, achieving up to 95% AUROC with max-pooling, thereby offering a more reliable solution for LRM behavior monitoring.
📝 Abstract
Large Reasoning Models (LRMs) introduce new opportunities for safety monitoring through their Chain of Thought (CoT) reasoning. However, CoT is not always faithful to the model's final output, undermining its reliability as a monitoring tool. To address this, we investigate the hidden representations of LRMs to determine whether future behavior can be predicted from prompt and CoT representations. By evaluating a probe at each generated token, we construct a probe trajectory, the continuous evolution of a concept's probability across the reasoning process. We find that future model behavior is more distinguishable when examined over the full trajectory than from a single static prediction. To characterize these temporal dynamics, we extract signal-processing features that capture volatility, trend, and steady-state behavior, significantly improving the separation of future model states. We also present two methodological insights. First, template-based training data achieves near-parity with dynamically generated model responses, eliminating the need for a costly initial inference and labeling. Second, the choice of pooling operation is critical: average-pooling and last-token methods collapse to near-random performance, while max-pooling achieves up to 95% AUROC and yields stable probe trajectories. Using four datasets and four reasoning models across the domains of safety and mathematics, we demonstrate that trajectory features encode task-specific dynamics that improve outcome separability. These findings establish probe trajectories as a complementary framework for monitoring LRM behavior. Warning: This article contains potentially harmful content.
Problem

Research questions and friction points this paper is trying to address.

Chain of Thought
Large Reasoning Models
safety monitoring
reasoning dynamics
probe trajectories
Innovation

Methods, ideas, or system contributions that make the work stand out.

probe trajectories
Chain of Thought
temporal dynamics
max-pooling
reasoning monitoring
🔎 Similar Papers
2024-10-04International Conference on Learning RepresentationsCitations: 9