Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the limitations of traditional Proximal Policy Optimization (PPO) in large language model post-training, where reliance on all rollout data renders training susceptible to noise and unfaithful reasoning samples, leading to degraded performance and inefficiency. To mitigate this, the authors propose Influence-guided PPO (I-PPO), a novel framework that integrates data attribution into the reinforcement learning post-training pipeline for the first time. I-PPO computes influence scores for each episode via gradient approximation to dynamically identify and prioritize high-quality samples aligned with the validation objective. This approach not only introduces an intrinsic early-stopping mechanism that enhances training efficiency but also significantly improves reasoning faithfulness. Experimental results demonstrate that I-PPO consistently outperforms both supervised fine-tuning (SFT) and standard PPO across multiple benchmarks, effectively reducing unfaithful Chain-of-Thought reasoning, accelerating convergence, and achieving superior final performance.

Technology Category

Application Category

📝 Abstract

Traditional RL algorithms like Proximal Policy Optimization (PPO) typically train on the entire rollout buffer, operating under the assumption that all generated episodes provide a beneficial optimization signal. However, these episodes frequently contain noisy or unfaithful reasoning, which can degrade model performance and slow down training. In this paper, we propose \textbf{Influence-Guided PPO (I-PPO)}, a novel framework that integrates data attribution into the RL post-training loop. By calculating an influence score for each episode using a gradient-based approximation, I-PPO identifies and eliminates episodes that are anti-aligned with a validation gradient. Our experiments demonstrate that I-PPO consistently outperforms SFT and PPO baselines. We show that our filtering process acts as an intrinsic early stopping mechanism, accelerating training efficiency while effectively reducing unfaithful CoT reasoning.

Problem

Research questions and friction points this paper is trying to address.

data attribution

PPO

LLM post-training

unfaithful reasoning

rollout buffer

Innovation

Methods, ideas, or system contributions that make the work stand out.

Influence-Guided PPO

data attribution

gradient-based filtering