🤖 AI Summary
This work investigates the internal mechanisms by which reinforcement learning (RL) fine-tuning enhances large language model (LLM) capabilities. Addressing the lack of systematic explanations regarding how RL alters the activation strength and diversity of internal computational circuits, we propose an interpretable, circuit-level analysis framework based on Edge Attribution Patching (EAP), quantifying changes in activation magnitude, pattern entropy, and edge distribution. Experimental results across multiple LLMs show that online RL methods—particularly PPO and GRPO—consistently increase both activation strength and diversity, supporting a novel interpretation: RL induces more redundant and flexible information flow within the model. In contrast, preference-based optimization methods (e.g., DPO) yield weaker and less stable effects. To our knowledge, this is the first study to systematically characterize how distinct RL paradigms differentially modulate internal information dynamics across diverse LLM architectures. All code and analysis tools are publicly released to ensure reproducibility.
📝 Abstract
Large language models (LLMs) acquire extensive prior knowledge through large-scale pretraining and can be further enhanced via supervised fine-tuning (SFT) or reinforcement learning (RL)-based post-training. A growing body of evidence has shown that RL fine-tuning improves the capability of LLMs beyond what SFT alone achieves. However, the underlying mechanisms why RL fine-tuning is able to enhance the capability of various LLMs with distinct intrinsic characteristics remain underexplored. In this study, we draw inspiration from prior work on edge attribution patching (EAP) to investigate the internal differences of LLMs before and after RL fine-tuning. Our analysis across multiple model families shows two robust effects of online RL post-training: (i) an overall increase in activation intensity, indicating that more internal pathways are engaged and their signals become stronger, and (ii) greater diversity in activation patterns, reflected by higher entropy and less concentrated edge distributions. These changes suggest that RL reshapes information flow to be both more redundant and more flexible, which may explain its advantage in generalization. Notably, models fine-tuned with Direct Preference Optimization (DPO) deviate from these trends, exhibiting substantially weaker or inconsistent internal changes compared to PPO- and GRPO-based training. Together, our findings provide a unified view of how RL fine-tuning systematically alters the internal circuitry of LLMs and highlight the methodological distinctions between online RL and preference-based approaches. Our code is open source at https://anonymous.4open.science/r/llm_rl_probing_analysis-F673.