GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work proposes a gradient-alignment-based data selection method for reinforcement learning with large language models, addressing the high sensitivity of training to data quality. Traditional human-curated or heuristic filtering approaches struggle to handle inefficient or erroneous samples arising from non-stationary policy optimization. The proposed method dynamically constructs an adaptive curriculum by leveraging the alignment between the policy gradient and the gradient direction computed on a small, trusted validation set as its core selection criterion. This approach significantly outperforms existing baselines under challenging conditions such as unreliable rewards, distributional imbalance, and low-quality corpora, leading to improved training stability and enhanced final performance.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems. This sensitivity stems from the non-stationarity of RL: rollouts are generated by an evolving policy, and learning is shaped by exploration and reward feedback, unlike supervised fine-tuning (SFT) with fixed trajectories. As a result, prior work often relies on manual curation or simple heuristic filters (e.g., accuracy), which can admit incorrect or low-utility problems. We propose GradAlign, a gradient-aligned data selection method for LLM reinforcement learning that uses a small, trusted validation set to prioritize training problems whose policy gradients align with validation gradients, yielding an adaptive curriculum. We evaluate GradAlign across three challenging data regimes: unreliable reward signals, distribution imbalance, and low-utility training corpus, showing that GradAlign consistently outperforms existing baselines, underscoring the importance of directional gradient signals in navigating non-stationary policy optimization and yielding more stable training and improved final performance. We release our implementation at https://github.com/StigLidu/GradAlign

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

data selection

large language models

non-stationarity

training data quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

gradient alignment

data selection

reinforcement learning