Interpreting and Controlling LLM Reasoning through Integrated Policy Gradient

📅 2026-02-02
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods struggle to precisely identify the internal mechanisms within large language models that support complex reasoning and fail to effectively model the sequential influence from internal components to final outputs. This work proposes the Integrated Policy Gradient (IPG) framework, which introduces— for the first time—the policy gradient concept from reinforcement learning into the interpretability research of large language models. By backpropagating composite signals such as reasoning outcomes and incorporating retrospective analysis of reasoning trajectories, IPG identifies and modulates neurons or modules that cumulatively contribute to long-range reasoning. Experiments demonstrate that IPG achieves more accurate mechanistic localization across multiple reasoning models and effectively tunes both the capability and intensity of reasoning, thereby validating its efficacy and generalizability.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) demonstrate strong reasoning abilities in solving complex real-world problems. Yet, the internal mechanisms driving these complex reasoning behaviors remain opaque. Existing interpretability approaches targeting reasoning either identify components (e.g., neurons) correlated with special textual patterns, or rely on human-annotated contrastive pairs to derive control vectors. Consequently, current methods struggle to precisely localize complex reasoning mechanisms or capture sequential influence from model internal workings to the reasoning outputs. In this paper, built on outcome-oriented and sequential-influence-aware principles, we focus on identifying components that have sequential contribution to reasoning behavior where outcomes are cumulated by long-range effects. We propose Integrated Policy Gradient (IPG), a novel framework that attributes reasoning behaviors to model's inner components by propagating compound outcome-based signals such as post reasoning accuracy backward through model inference trajectories. Empirical evaluations demonstrate that our approach achieves more precise localization and enables reliable modulation of reasoning behaviors (e.g., reasoning capability, reasoning strength) across diverse reasoning models.
Problem

Research questions and friction points this paper is trying to address.

interpretability
reasoning mechanisms
sequential influence
large language models
model internals
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrated Policy Gradient
LLM reasoning
interpretability
sequential influence
reasoning control
🔎 Similar Papers
No similar papers found.
C
Changming Li
ShanghaiTech University
K
Kaixing Zhang
ShanghaiTech University
H
Haoyun Xu
ShanghaiTech University
Y
Yingdong Shi
ShanghaiTech University
Z
Zheng Zhang
ShanghaiTech University
Kaitao Song
Kaitao Song
Senior Researcher, Microsoft Research
Natural Language ProcessingLarge Language ModelsArtificial General Intelligence
Kan Ren
Kan Ren
Assistant Professor, ShanghaiTech University
Machine LearningData MiningLarge Language ModelFoundation Model