Infinite Sampling: Efficient and Stable Grouped RL Training for Large Language Models

📅 2025-06-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the prohibitive GPU memory overhead and scalability limitations induced by large-scale response sampling in group-based RLHF fine-tuning (e.g., GRPO), this paper proposes an efficient training framework that decouples group size from memory consumption. Our method introduces two key innovations: (1) micro-group sampling coupled with continuous interleaved generation to reduce instantaneous GPU memory peaks; and (2) a two-stage hybrid strategy comprising token-level sequence length prediction, followed by FPTAS-based global grouping and Shortest-Job-First (SJF) runtime scheduling. Experiments demonstrate that, while preserving training stability, our approach reduces peak GPU memory usage by over 50% and improves throughput by more than 25%, significantly enhancing training efficiency and scalability under hardware-constrained settings.

Technology Category

Application Category

📝 Abstract
Group-based reinforcement learning algorithms such as Group Reward Policy Optimization (GRPO) have proven effective for fine-tuning large language models (LLMs) with human feedback. However, generating and storing multiple responses per prompt incurs substantial memory overhead, especially as the sample group size increases, limiting scalability under constrained hardware. We propose Infinite Sampling, a framework that enables efficient and stable GRPO training by decoupling group size from GPU memory usage. It consists of: (1) micro sampling groups that decompose large groups into memory-feasible rounds; (2) continuous sampling that interleaves generation across groups to improve utilization; and (3) a length-aware scheduler combining token-conditioned sequence length prediction with a two-stage plan: global grouping via FPTAS and runtime refill via SJF. Experiments show that our Micro Sampling Groups reduce peak memory usage by over 50% compared to full-group decoding (e.g., from 21.55 GB to 10.64 GB on Qwen3-1.7B). Building on this, Infinite Sampling improves throughput by over 25% compared to the naive micro sampling group method, reducing decoding steps while maintaining full-length completions and memory usage. Our hybrid scheduling ensures efficient and stable GRPO training with larger groups under realistic GPU memory constraints.
Problem

Research questions and friction points this paper is trying to address.

Reducing memory overhead in grouped RL training for LLMs
Decoupling group size from GPU memory usage efficiently
Improving throughput and stability in GRPO training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Micro sampling groups reduce memory usage
Continuous sampling interleaves generation efficiently
Length-aware scheduler optimizes sequence grouping
🔎 Similar Papers
No similar papers found.
L
Liangyu Wang
King Abdullah University of Science and Technology, Saudi Arabia
H
Huanyi Xie
King Abdullah University of Science and Technology, Saudi Arabia
X
Xinhai Wang
King Abdullah University of Science and Technology, Saudi Arabia
Tianjin Huang
Tianjin Huang
Asst. Professor, CS@University of Exeter & Researcher Fellow, CS@TU/e
LLMsAdversarial examplesStable TrainingGraph Neural NetworkSparse Training
Mengdi Li
Mengdi Li
King Abdullah University of Science and Technology
Reinforcement LearningLLMsRobotics
D
Di Wang
King Abdullah University of Science and Technology, Saudi Arabia