From Emergence to Control: Probing and Modulating Self-Reflection in Language Models

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work investigates the origin, mechanisms, and controllability of self-reflection in large language models (LLMs). We find that self-reflection capability is implicitly present—even in pre-trained, non-RLHF-tuned models—but exhibits extremely low activation rates (e.g., only 0.6% in Qwen2.5). To address this, we propose a training-free bidirectional control paradigm: (i) reflection-inducing probes to activate latent reflective behavior, and (ii) interpretable direction vectors to precisely modulate reflection intensity. Mechanism validity is verified via internal representation analysis and comparison with RLVR. Experiments show our method increases Qwen2.5’s reflection frequency to 18.6%, improves reasoning performance by up to 12%, and enables on-demand reflection suppression to reduce computational overhead. This constitutes the first approach enabling flexible trade-offs between reasoning quality and inference efficiency.

Technology Category

Application Category

📝 Abstract

Self-reflection -- the ability of a large language model (LLM) to revisit, evaluate, and revise its own reasoning -- has recently emerged as a powerful behavior enabled by reinforcement learning with verifiable rewards (RLVR). While self-reflection correlates with improved reasoning accuracy, its origin and underlying mechanisms remain poorly understood. In this work, {it we first show that self-reflection is not exclusive to RLVR fine-tuned models: it already emerges, albeit rarely, in pretrained models}. To probe this latent ability, we introduce Reflection-Inducing Probing, a method that injects reflection-triggering reasoning traces from fine-tuned models into pretrained models. This intervention raises self-reflection frequency of Qwen2.5 from 0.6% to 18.6%, revealing a hidden capacity for reflection. Moreover, our analysis of internal representations shows that both pretrained and fine-tuned models maintain hidden states that distinctly separate self-reflective from non-reflective contexts. Leveraging this observation, {it we then construct a self-reflection vector, a direction in activation space associated with self-reflective reasoning}. By manipulating this vector, we enable bidirectional control over the self-reflective behavior for both pretrained and fine-tuned models. Experiments across multiple reasoning benchmarks show that enhancing these vectors improves reasoning performance by up to 12%, while suppressing them reduces computational cost, providing a flexible mechanism to navigate the trade-off between reasoning quality and efficiency without requiring additional training. Our findings further our understanding of self-reflection and support a growing body of work showing that understanding model internals can enable precise behavioral control.

Problem

Research questions and friction points this paper is trying to address.

Understanding self-reflection emergence in pretrained language models

Probing latent self-reflection ability in pretrained models

Controlling self-reflection behavior via activation space manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reflection-Inducing Probing triggers self-reflection

Self-reflection vector enables bidirectional control

Manipulating vectors improves reasoning performance

🔎 Similar Papers

No similar papers found.