🤖 AI Summary
This work addresses the severe memory bottleneck imposed by KV cache during the rollout phase of reinforcement learning (RL) fine-tuning in long-context tasks, where existing compression methods often introduce hard-to-correct off-policy bias. To tackle this challenge, the paper proposes Shadow Mask Distillation—a novel approach that, for the first time, integrates knowledge distillation into the joint optimization of KV cache compression and policy alignment. By leveraging distillation to align policy behavior between sparse and dense context representations, the method effectively mitigates the amplification of compression errors that typically destabilizes RL training. Compatible with mainstream RL frameworks such as PPO, GRPO, and DPO, Shadow Mask Distillation significantly reduces memory consumption while preserving policy alignment performance, thereby overcoming the high variance and sample inefficiency inherent in conventional reweighting-based approaches.
📝 Abstract
Reinforcement Learning (RL) has emerged as a crucial paradigm for unlocking the advanced reasoning capabilities of Large Language Models (LLMs), encompassing frameworks like RLHF and RLAIF. Regardless of the specific optimization algorithm (e.g., PPO, GRPO, or Online DPO), online RL inherently requires an exploratory trajectory generation (rollout) phase. However, for long-context reasoning tasks, this rollout phase imposes a severe ``memory wall'' due to the exorbitant Key-Value (KV) cache footprint. While applying KV cache compression during rollouts mitigates this memory overhead, it induces a critical off-policy bias. Although modern KV compression is often nearly lossless during standard inference, even minuscule approximation errors are drastically amplified by the inherent instability of RL optimization. Specifically, the sampler generates responses under a sparse context, whereas the learner updates parameters using the full, dense context. Existing statistical solutions, such as importance reweighting, struggle to correct this magnified bias, suffering from high gradient variance and severe sample inefficiency.