🤖 AI Summary
Existing methods struggle to precisely identify the internal mechanisms within large language models that support complex reasoning and fail to effectively model the sequential influence from internal components to final outputs. This work proposes the Integrated Policy Gradient (IPG) framework, which introduces— for the first time—the policy gradient concept from reinforcement learning into the interpretability research of large language models. By backpropagating composite signals such as reasoning outcomes and incorporating retrospective analysis of reasoning trajectories, IPG identifies and modulates neurons or modules that cumulatively contribute to long-range reasoning. Experiments demonstrate that IPG achieves more accurate mechanistic localization across multiple reasoning models and effectively tunes both the capability and intensity of reasoning, thereby validating its efficacy and generalizability.
📝 Abstract
Large language models (LLMs) demonstrate strong reasoning abilities in solving complex real-world problems. Yet, the internal mechanisms driving these complex reasoning behaviors remain opaque. Existing interpretability approaches targeting reasoning either identify components (e.g., neurons) correlated with special textual patterns, or rely on human-annotated contrastive pairs to derive control vectors. Consequently, current methods struggle to precisely localize complex reasoning mechanisms or capture sequential influence from model internal workings to the reasoning outputs. In this paper, built on outcome-oriented and sequential-influence-aware principles, we focus on identifying components that have sequential contribution to reasoning behavior where outcomes are cumulated by long-range effects. We propose Integrated Policy Gradient (IPG), a novel framework that attributes reasoning behaviors to model's inner components by propagating compound outcome-based signals such as post reasoning accuracy backward through model inference trajectories. Empirical evaluations demonstrate that our approach achieves more precise localization and enables reliable modulation of reasoning behaviors (e.g., reasoning capability, reasoning strength) across diverse reasoning models.