From Latent Signals to Reflection Behavior: Tracing Meta-Cognitive Activation Trajectory in R1-Style LLMs

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Although R1-style large language models exhibit introspective capabilities, the underlying mechanisms remain unclear. This study employs the logit lens technique to trace the activation trajectories of introspective behavior layer by layer, revealing for the first time a staged developmental pathway: semantic prompt responses emerge in latent control layers, are integrated with discursive cues in semantic hub layers, and culminate in explicit behavioral outputs. By combining token-level semantic readouts, linear direction analysis, intervention experiments, and probability mass tracking, the work establishes a causal chain—“semantic prompt → latent control direction → discursive cue → behavioral output”—demonstrating that prompt semantics causally modulate introspective behavior and elucidating how human-like metacognitive processes are implemented within the model.

Technology Category

Application Category

📝 Abstract

R1-style LLMs have attracted growing attention for their capacity for self-reflection, yet the internal mechanisms underlying such behavior remain unclear. To bridge this gap, we anchor on the onset of reflection behavior and trace its layer-wise activation trajectory. Using the logit lens to read out token-level semantics, we uncover a structured progression: (i) Latent-control layers, where an approximate linear direction encodes the semantics of thinking budget; (ii) Semantic-pivot layers, where discourse-level cues, including turning-point and summarization cues, surface and dominate the probability mass; and (iii) Behavior-overt layers, where the likelihood of reflection-behavior tokens begins to rise until they become highly likely to be sampled. Moreover, our targeted interventions uncover a causal chain across these stages: prompt-level semantics modulate the projection of activations along latent-control directions, thereby inducing competition between turning-point and summarization cues in semantic-pivot layers, which in turn regulates the sampling likelihood of reflection-behavior tokens in behavior-overt layers. Collectively, our findings suggest a human-like meta-cognitive process-progressing from latent monitoring, to discourse-level regulation, and to finally overt self-reflection. Our analysis code can be found at https://github.com/DYR1/S3-CoT.

Problem

Research questions and friction points this paper is trying to address.

self-reflection

meta-cognitive activation

latent signals

large language models

reflection behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

meta-cognition

reflection behavior

activation trajectory