Infinite Sampling: Efficient and Stable Grouped RL Training for Large Language Models

📅 2025-06-28

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

To address the prohibitive GPU memory overhead and scalability limitations induced by large-scale response sampling in group-based RLHF fine-tuning (e.g., GRPO), this paper proposes an efficient training framework that decouples group size from memory consumption. Our method introduces two key innovations: (1) micro-group sampling coupled with continuous interleaved generation to reduce instantaneous GPU memory peaks; and (2) a two-stage hybrid strategy comprising token-level sequence length prediction, followed by FPTAS-based global grouping and Shortest-Job-First (SJF) runtime scheduling. Experiments demonstrate that, while preserving training stability, our approach reduces peak GPU memory usage by over 50% and improves throughput by more than 25%, significantly enhancing training efficiency and scalability under hardware-constrained settings.

Technology Category

Application Category

📝 Abstract

Group-based reinforcement learning algorithms such as Group Reward Policy Optimization (GRPO) have proven effective for fine-tuning large language models (LLMs) with human feedback. However, generating and storing multiple responses per prompt incurs substantial memory overhead, especially as the sample group size increases, limiting scalability under constrained hardware. We propose Infinite Sampling, a framework that enables efficient and stable GRPO training by decoupling group size from GPU memory usage. It consists of: (1) micro sampling groups that decompose large groups into memory-feasible rounds; (2) continuous sampling that interleaves generation across groups to improve utilization; and (3) a length-aware scheduler combining token-conditioned sequence length prediction with a two-stage plan: global grouping via FPTAS and runtime refill via SJF. Experiments show that our Micro Sampling Groups reduce peak memory usage by over 50% compared to full-group decoding (e.g., from 21.55 GB to 10.64 GB on Qwen3-1.7B). Building on this, Infinite Sampling improves throughput by over 25% compared to the naive micro sampling group method, reducing decoding steps while maintaining full-length completions and memory usage. Our hybrid scheduling ensures efficient and stable GRPO training with larger groups under realistic GPU memory constraints.

Problem

Research questions and friction points this paper is trying to address.

Reducing memory overhead in grouped RL training for LLMs

Decoupling group size from GPU memory usage efficiently

Improving throughput and stability in GRPO training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Micro sampling groups reduce memory usage

Continuous sampling interleaves generation efficiently

Length-aware scheduler optimizes sequence grouping

🔎 Similar Papers

Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining