LearnAlign: Reasoning Data Selection for Reinforcement Learning in Large Language Models Based on Improved Gradient Alignment

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address low inference data efficiency and gradient bias induced by response length in reinforcement learning (RL) for large language models (LLMs), this paper proposes a success-rate-driven improved gradient alignment data selection method. The approach explicitly incorporates success rate—a learnability metric—into the gradient alignment framework to mitigate response-length-induced gradient distortion, marking the first such integration. Additionally, it introduces a lightweight filtering mechanism enabling staged RL training, thereby tightly coupling data selection with policy optimization. Empirical evaluation on GSM8K demonstrates that the method achieves 77.53% accuracy using only 1,000 samples—surpassing full-dataset training (77.04%)—while substantially reducing data requirements and enhancing generalization performance.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has become a key technique for enhancing LLMs' reasoning abilities, yet its data inefficiency remains a major bottleneck. To address this critical yet challenging issue, we present a novel gradient-alignment-based method, named LearnAlign, which intelligently selects the learnable and representative training reasoning data for RL post-training. To overcome the well-known issue of response-length bias in gradient norms, we introduce the data learnability based on the success rate, which can indicate the learning potential of each data point. Experiments across three mathematical reasoning benchmarks demonstrate that our method significantly reduces training data requirements while achieving minor performance degradation or even improving performance compared to full-data training. For example, it reduces data requirements by up to 1,000 data points with better performance (77.53%) than that on the full dataset on GSM8K benchmark (77.04%). Furthermore, we show its effectiveness in the staged RL setting. This work provides valuable insights into data-efficient RL post-training and establishes a foundation for future research in optimizing reasoning data selection.To facilitate future work, we will release code.

Problem

Research questions and friction points this paper is trying to address.

Addresses data inefficiency in RL for LLMs reasoning enhancement

Overcomes response-length bias in gradient norms for data selection

Reduces training data needs while maintaining or improving performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-alignment-based data selection method

Data learnability based on success rate

Reduces training data requirements significantly

🔎 Similar Papers

Inverse-RLignment: Inverse Reinforcement Learning from Demonstrations for LLM Alignment