Policy Expansion for Bridging Offline-to-Online Reinforcement Learning

📅 2023-02-02

🏛️ International Conference on Learning Representations

📈 Citations: 74

✨ Influential: 10

career value

202K/year

🤖 AI Summary

Offline pretraining often suffers from rapid degradation and poor exploration during early online reinforcement learning. To address this, we propose a policy expansion mechanism that treats the frozen offline policy as a fixed behavioral prior, dynamically coordinating it with a learnable online policy. Our key contribution is the first adaptive dual-policy architecture, integrating a policy-ensemble-based gating mechanism with behavioral distribution matching constraints. This ensures the offline policy remains unupdated while continuously guiding exploration, while enabling the online policy to incrementally acquire novel behaviors. Evaluated on multiple continuous-control benchmarks, our method significantly improves sample efficiency and final performance, avoids initial performance collapse, and achieves more stable convergence—outperforming standard fine-tuning and policy distillation baselines across all metrics.

📝 Abstract

Pre-training with offline data and online fine-tuning using reinforcement learning is a promising strategy for learning control policies by leveraging the best of both worlds in terms of sample efficiency and performance. One natural approach is to initialize the policy for online learning with the one trained offline. In this work, we introduce a policy expansion scheme for this task. After learning the offline policy, we use it as one candidate policy in a policy set. We then expand the policy set with another policy which will be responsible for further learning. The two policies will be composed in an adaptive manner for interacting with the environment. With this approach, the policy previously learned offline is fully retained during online learning, thus mitigating the potential issues such as destroying the useful behaviors of the offline policy in the initial stage of online learning while allowing the offline policy participate in the exploration naturally in an adaptive manner. Moreover, new useful behaviors can potentially be captured by the newly added policy through learning. Experiments are conducted on a number of tasks and the results demonstrate the effectiveness of the proposed approach.

Problem

Research questions and friction points this paper is trying to address.

Bridging offline pre-training and online fine-tuning in reinforcement learning

Retaining useful offline policy behaviors during online learning

Enabling adaptive policy composition for improved exploration and performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy expansion scheme with offline policy retention

Adaptive composition of multiple policies for interaction

Offline policy as candidate enabling natural exploration

🔎 Similar Papers

To Switch or Not to Switch? Balanced Policy Switching in Offline Reinforcement Learning