Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Standard reinforcement learning struggles with heterogeneous tasks exhibiting long-tailed distributions in LLM inference training due to uniform sampling and fixed rollout counts, leading to inefficient resource allocation and insufficient training on hard examples. This work proposes a Group Distributionally Robust Optimization (GDRO) framework featuring an online difficulty classifier that dynamically partitions prompts into difficulty groups. It introduces Prompt-GDRO and Rollout-GDRO mechanisms to adaptively rebalance the training distribution and allocate computational resources, respectively. By integrating EMA-debiased multiplicative-weight multi-armed bandit sampling, shadow price control, and a square-root optimal rollout strategy, the method achieves unbiased reweighting and computation-neutral self-evolving curriculum learning. Evaluated on DAPO 14.1k, it improves the pass@8 accuracy of Qwen3-Base (1.7B–8B) by 10.6% and 10.1% on average over the GRPO baseline.

Technology Category

Application Category

📝 Abstract

Recent progress in Large Language Model (LLM) reasoning is increasingly driven by the refinement of post-training loss functions and alignment strategies. However, standard Reinforcement Learning (RL) paradigms like Group Relative Policy Optimization (GRPO) remain constrained by static uniformity: uniform prompt sampling and a fixed number of rollouts per prompt. For heterogeneous, heavy-tailed reasoning data, this creates structural inefficiencies that waste compute on already-solved patterns while under-training the long tail of hard problems. To address this, we propose Multi-Adversary Group Distributionally Robust Optimization (GDRO), an optimization-first framework that moves beyond uniform reasoning models by dynamically adapting the training distribution. We introduce an Online Difficulty Classifier that partitions prompts into dynamic pass@k difficulty groups. We then propose two independent GDRO games for post-training: (1) Prompt-GDRO, which employs an EMA-debiased multiplicative-weights bandit sampler to target the intensive difficulty margin and upweight persistently hard groups without frequency bias; and (2) Rollout-GDRO, which uses a shadow-price controller to reallocate rollouts across groups, maximizing gradient variance reduction on hard tasks under a fixed mean budget (compute-neutral). We provide no-regret guarantees for both controllers and additionally a variance-proxy analysis motivating a square-root optimal rollout allocation for Rollout-GDRO. We validate our framework on the DAPO 14.1k dataset using Qwen3-Base models. Prompt-GDRO and Rollout-GDRO achieve average relative gains of +10.6% and +10.1%, respectively, in pass@8 accuracy across 1.7B, 4B, and 8B scales compared to the GRPO baseline. Qualitative analysis shows an emergent curriculum: the adversaries shift resources to the evolving reasoning frontier, enhancing the reasoning model's performance.

Problem

Research questions and friction points this paper is trying to address.

Large Language Model

Reinforcement Learning

Heavy-tailed Data

Training Efficiency

Reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Group Distributionally Robust Optimization

Dynamic Difficulty Grouping

Prompt-GDRO