🤖 AI Summary
This work investigates the intrinsic neural mechanisms underlying reflective reasoning in large language models (LLMs), addressing a critical gap in prior research—its overreliance on prompt engineering while neglecting interpretable neural representations. We propose a steering method based on intermediate-layer activations to construct directional vectors encoding three distinct reflective intents: “no reflection,” “intrinsic reflection,” and “triggered reflection.” This enables precise, controllable enhancement or suppression of reflective behavior for the first time. Experiments reveal that reflection manifests hierarchically across model layers and is more readily suppressible than inducible. Evaluated on GSM8k-adv with Qwen2.5-3B and Gemma3-4B, our approach achieves accurate intervention at targeted reflection layers and significantly improves robustness on complex reasoning tasks. The core contribution lies in uncovering an interpretable, neurally grounded basis for reflection—and demonstrating its controllability—thereby establishing a mechanism-driven paradigm for reasoning optimization.
📝 Abstract
Reflection, the ability of large language models (LLMs) to evaluate and revise their own reasoning, has been widely used to improve performance on complex reasoning tasks. Yet, most prior work emphasizes designing reflective prompting strategies or reinforcement learning objectives, leaving the inner mechanisms of reflection underexplored. In this paper, we investigate reflection through the lens of latent directions in model activations. We propose a methodology based on activation steering to characterize how instructions with different reflective intentions: no reflection, intrinsic reflection, and triggered reflection. By constructing steering vectors between these reflection levels, we demonstrate that (1) new reflection-inducing instructions can be systematically identified, (2) reflective behavior can be directly enhanced or suppressed through activation interventions, and (3) suppressing reflection is considerably easier than stimulating it. Experiments on GSM8k-adv with Qwen2.5-3B and Gemma3-4B reveal clear stratification across reflection levels, and steering interventions confirm the controllability of reflection. Our findings highlight both opportunities (e.g., reflection-enhancing defenses) and risks (e.g., adversarial inhibition of reflection in jailbreak attacks). This work opens a path toward mechanistic understanding of reflective reasoning in LLMs.