🤖 AI Summary
This work addresses the limitations of traditional Proximal Policy Optimization (PPO) in large language model post-training, where reliance on all rollout data renders training susceptible to noise and unfaithful reasoning samples, leading to degraded performance and inefficiency. To mitigate this, the authors propose Influence-guided PPO (I-PPO), a novel framework that integrates data attribution into the reinforcement learning post-training pipeline for the first time. I-PPO computes influence scores for each episode via gradient approximation to dynamically identify and prioritize high-quality samples aligned with the validation objective. This approach not only introduces an intrinsic early-stopping mechanism that enhances training efficiency but also significantly improves reasoning faithfulness. Experimental results demonstrate that I-PPO consistently outperforms both supervised fine-tuning (SFT) and standard PPO across multiple benchmarks, effectively reducing unfaithful Chain-of-Thought reasoning, accelerating convergence, and achieving superior final performance.
📝 Abstract
Traditional RL algorithms like Proximal Policy Optimization (PPO) typically train on the entire rollout buffer, operating under the assumption that all generated episodes provide a beneficial optimization signal. However, these episodes frequently contain noisy or unfaithful reasoning, which can degrade model performance and slow down training. In this paper, we propose \textbf{Influence-Guided PPO (I-PPO)}, a novel framework that integrates data attribution into the RL post-training loop. By calculating an influence score for each episode using a gradient-based approximation, I-PPO identifies and eliminates episodes that are anti-aligned with a validation gradient. Our experiments demonstrate that I-PPO consistently outperforms SFT and PPO baselines. We show that our filtering process acts as an intrinsic early stopping mechanism, accelerating training efficiency while effectively reducing unfaithful CoT reasoning.