🤖 AI Summary
This work proposes a gradient-alignment-based data selection method for reinforcement learning with large language models, addressing the high sensitivity of training to data quality. Traditional human-curated or heuristic filtering approaches struggle to handle inefficient or erroneous samples arising from non-stationary policy optimization. The proposed method dynamically constructs an adaptive curriculum by leveraging the alignment between the policy gradient and the gradient direction computed on a small, trusted validation set as its core selection criterion. This approach significantly outperforms existing baselines under challenging conditions such as unreliable rewards, distributional imbalance, and low-quality corpora, leading to improved training stability and enhanced final performance.
📝 Abstract
Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems. This sensitivity stems from the non-stationarity of RL: rollouts are generated by an evolving policy, and learning is shaped by exploration and reward feedback, unlike supervised fine-tuning (SFT) with fixed trajectories. As a result, prior work often relies on manual curation or simple heuristic filters (e.g., accuracy), which can admit incorrect or low-utility problems. We propose GradAlign, a gradient-aligned data selection method for LLM reinforcement learning that uses a small, trusted validation set to prioritize training problems whose policy gradients align with validation gradients, yielding an adaptive curriculum. We evaluate GradAlign across three challenging data regimes: unreliable reward signals, distribution imbalance, and low-utility training corpus, showing that GradAlign consistently outperforms existing baselines, underscoring the importance of directional gradient signals in navigating non-stationary policy optimization and yielding more stable training and improved final performance. We release our implementation at https://github.com/StigLidu/GradAlign