🤖 AI Summary
This work addresses the inefficiency of static GPU resource allocation in reinforcement learning caused by long-tailed rollout computations. To tackle this, the authors propose a cooperative elastic mechanism that dynamically reuses idle GPU compute and memory resources within serving clusters to execute rollouts, while strictly adhering to online service-level objectives (SLOs). The approach pioneers efficient, elastic sharing between training and inference resources through three key components: an SLO-aware co-serving executor, a cross-cluster sparse weight transfer engine, and an elastic rollout scheduler. Experimental results demonstrate that, across diverse model scales and cluster configurations, the method achieves 1.20–3.31× higher end-to-end training throughput compared to existing baselines.
📝 Abstract
Agentic reinforcement learning (RL) has emerged as a key driver for improving the multi-step reasoning and tool-use capabilities of LLMs. However, its efficiency is bottlenecked by long-tail rollouts with multi-turn environment interactions, making static GPU provisioning a poor fit: overprovisioning wastes GPUs on stragglers, while underprovisioning increases contention and slows training.
We observe that production serving clusters routinely leave substantial GPU compute and memory headroom. Based on this observation, we argue for cooperative elasticity: opportunistically repurposing underutilized serving GPUs to execute rollouts. Realizing cooperative elasticity is non-trivial because it must preserve serving Service Level Objectives (SLOs) under bursty traffic and minimize communication overhead. To address these challenges, we present ROSE, a cooperative, resource-elastic post-training system that safely harvests idle compute and memory on serving GPUs to accelerate agentic RL rollouts. ROSE consists of three components: (1) an SLO-safe co-serving executor that improves rollout throughput while preserving serving SLOs through efficient GPU memory and compute sharing; (2) a cross-cluster weight transfer engine that leverages weight shards and sparsity for fast weight synchronization across clusters; and (3) an elastic rollout scheduler that dynamically provisions cooperative capacity and routes trajectory rollouts across dedicated rollout GPUs and opportunistic serving GPUs. Experiments across multiple model sizes and cluster scales show that ROSE improves average end-to-end throughput by 1.20-3.31 x compared with state-of-the-art resource-fixed and elastic baselines.