ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

πŸ“… 2026-05-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of limited context length in large language models (LLMs) when applied to multi-turn, complex agent tasks, where existing compression methods struggle to balance information retention and reasoning efficiency. The authors propose ZipRL, a novel framework that integrates non-uniform, multi-granularity adaptive compression with a Hindsight Response Replay (HRR) mechanism within a reinforcement learning paradigm. This approach effectively mitigates sparse reward problems and is theoretically shown to yield higher task utility than uniform compression strategies. ZipRL employs coarse-to-fine prompting for macro-level compression and incorporates HRR into the GRPO algorithm, enhanced by generalized advantage reshaping to refine training signals. Experiments demonstrate that ZipRL significantly outperforms state-of-the-art methods across five agent tasks, achieving performance gains of 27.9% and 34.7% on Qwen3-4B and Qwen3-8B, respectively, while maintaining efficiency and robustness in extreme 256-turn extrapolation scenarios.
πŸ“ Abstract
Adaptive context compression is vital for scaling Large Language Models (LLMs) to complex, multi-turn agent tasks. However, rule-based compression methods may discard task-critical nuances, while Reinforcement Learning (RL) approaches usually struggle to balance information retention and token efficiency under the sparse rewards inherent to long-horizon workflows. To bridge this gap, we propose ZipRL, a novel adaptive compression framework tailored for Reinforcement Learning from Verifiable Rewards (RLVR). ZipRL features a multi-granularity compression mechanism for active, non-uniform information reduction, coupled with Hindsight Response Replay (HRR), a technique designed to densify training signals during RLVR optimization. Theoretically, we prove ZipRL's superior task-relevant utility over uniform methods. Concretely, ZipRL utilizes coarse-to-fine prompts for macro-compression and incorporates HRR into GRPO via generalized advantage reshaping. Multiple models of varying versions and parameter scales validate the effectiveness of our approach. Benchmarks on five agent tasks show ZipRL outperforms state-of-the-art approaches by 27.9% and 34.7% across Qwen3-4B and Qwen3-8B models, while maintaining exceptional token efficiency and robustness under extreme 256-turn extrapolation stress tests.
Problem

Research questions and friction points this paper is trying to address.

context compression
multi-turn agent tasks
reinforcement learning
token efficiency
information retention
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive context compression
reinforcement learning from verifiable rewards
hindsight response replay
multi-granularity compression
token efficiency