🤖 AI Summary
This work addresses the substantial memory overhead of key-value (KV) caching in large language models during reinforcement learning, which arises from long-sequence rollouts and hinders efficient training on resource-constrained hardware. To mitigate this, the authors propose a sparse rollout mechanism that integrates sparsity-aware rejection sampling with importance reweighting to effectively correct off-policy bias induced by KV cache compression. This approach ensures training stability while significantly reducing memory consumption. The method establishes an end-to-end sparse reinforcement learning framework that maintains competitive model performance despite drastically lower rollout memory usage and enhances robustness for deployment under sparse inference conditions.
📝 Abstract
Reinforcement Learning (RL) has become essential for eliciting complex reasoning capabilities in Large Language Models (LLMs). However, the substantial memory overhead of storing Key-Value (KV) caches during long-horizon rollouts acts as a critical bottleneck, often prohibiting efficient training on limited hardware. While existing KV compression techniques offer a remedy for inference, directly applying them to RL training induces a severe policy mismatch, leading to catastrophic performance collapse. To address this, we introduce Sparse-RL empowers stable RL training under sparse rollouts. We show that instability arises from a fundamental policy mismatch among the dense old policy, the sparse sampler policy, and the learner policy. To mitigate this issue, Sparse-RL incorporates Sparsity-Aware Rejection Sampling and Importance-based Reweighting to correct the off-policy bias introduced by compression-induced information loss. Experimental results show that Sparse-RL reduces rollout overhead compared to dense baselines while preserving the performance. Furthermore, Sparse-RL inherently implements sparsity-aware training, significantly enhancing model robustness during sparse inference deployment.