COOPO: Cyclic Offline-Online Policy Optimization Algorithm

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses key challenges in offline reinforcement learning—namely distributional shift and performance bottlenecks—as well as the high interaction cost of purely online methods and the susceptibility of hybrid approaches to catastrophic forgetting and distribution drift. To this end, the authors propose a cyclic policy optimization framework that alternates periodically between KL-regularized, advantage-weighted offline policy updates and general online policy optimization. This novel offline-online cyclic training mechanism effectively anchors the learned policy to the behavior data distribution while enabling efficient exploration. The approach significantly improves sample efficiency and data reuse, achieving higher returns on the D4RL benchmark with substantially fewer online interactions, and demonstrates robust performance across diverse algorithmic instantiations.

📝 Abstract

Offline reinforcement learning struggles with distributional shift and constrained performance due to static dataset limitations, while online RL demands prohibitive environment interactions. The recent advent of hybrid offline-to-online methods bridges these domains but suffers from distribution drift during transitions and catastrophic forgetting of offline knowledge. We introduce COOPO (Cyclic Offline-Online Policy Optimization), a generalized framework that repeatedly cycles between constrained offline training and online fine-tuning. Each cycle first anchors the policy to the dataset via KL-regularized advantage-weighted offline updates to minimize distributional shift and then fine-tunes it online using any policy optimization for stable exploration. Crucially, periodically returning to offline training eliminates forgetting and drift while maximizing dataset reuse. The cyclic behavior also helps reduce the online environment interactions. Theoretically, COOPO achieves better online sample efficiency, surpassing pure online RL, with guaranteed monotonic improvement under standard coverage assumptions. Extensive D4RL benchmarks demonstrate COOPO reduces online interactions versus state-of-the-art hybrids while improving final returns, maintaining robustness across diverse offline algorithms and online optimizers. This looped synergy sets new efficiency and performance standards for adaptive RL.

Problem

Research questions and friction points this paper is trying to address.

offline reinforcement learning

distributional shift

catastrophic forgetting

online fine-tuning

hybrid RL

Innovation

Methods, ideas, or system contributions that make the work stand out.

cyclic training

offline-online reinforcement learning

distributional shift mitigation