Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of traditional Proximal Policy Optimization (PPO) in large language model post-training, where reliance on all rollout data renders training susceptible to noise and unfaithful reasoning samples, leading to degraded performance and inefficiency. To mitigate this, the authors propose Influence-guided PPO (I-PPO), a novel framework that integrates data attribution into the reinforcement learning post-training pipeline for the first time. I-PPO computes influence scores for each episode via gradient approximation to dynamically identify and prioritize high-quality samples aligned with the validation objective. This approach not only introduces an intrinsic early-stopping mechanism that enhances training efficiency but also significantly improves reasoning faithfulness. Experimental results demonstrate that I-PPO consistently outperforms both supervised fine-tuning (SFT) and standard PPO across multiple benchmarks, effectively reducing unfaithful Chain-of-Thought reasoning, accelerating convergence, and achieving superior final performance.
📝 Abstract
Traditional RL algorithms like Proximal Policy Optimization (PPO) typically train on the entire rollout buffer, operating under the assumption that all generated episodes provide a beneficial optimization signal. However, these episodes frequently contain noisy or unfaithful reasoning, which can degrade model performance and slow down training. In this paper, we propose \textbf{Influence-Guided PPO (I-PPO)}, a novel framework that integrates data attribution into the RL post-training loop. By calculating an influence score for each episode using a gradient-based approximation, I-PPO identifies and eliminates episodes that are anti-aligned with a validation gradient. Our experiments demonstrate that I-PPO consistently outperforms SFT and PPO baselines. We show that our filtering process acts as an intrinsic early stopping mechanism, accelerating training efficiency while effectively reducing unfaithful CoT reasoning.
Problem

Research questions and friction points this paper is trying to address.

data attribution
PPO
LLM post-training
unfaithful reasoning
rollout buffer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Influence-Guided PPO
data attribution
gradient-based filtering
unfaithful reasoning
RL post-training
🔎 Similar Papers
No similar papers found.