🤖 AI Summary
Offline policy online fine-tuning often suffers from significant early performance degradation, primarily due to premature exploration overriding the initial policy. This work proposes a progressive exploration mechanism that accelerates environmental adaptation while preserving initial policy stability. Our core contributions are: (i) the first dynamic exploration gating scheme guided by online performance estimation, ensuring monotonic performance improvement throughout fine-tuning and overcoming the inevitable performance valley inherent in conventional methods; and (ii) integration with the Jump Start framework, unifying online performance evaluation, confidence-bound-guided exploration scheduling, and policy interpolation. Evaluated across diverse control tasks, our approach reduces average performance drop by 87%, accelerates convergence by 3.2×, and eliminates persistent degradation in all tasks.
📝 Abstract
Fine-tuning policies learned offline remains a major challenge in application domains. Monotonic performance improvement during emph{fine-tuning} is often challenging, as agents typically experience performance degradation at the early fine-tuning stage. The community has identified multiple difficulties in fine-tuning a learned network online, however, the majority of progress has focused on improving learning efficiency during fine-tuning. In practice, this comes at a serious cost during fine-tuning: initially, agent performance degrades as the agent explores and effectively overrides the policy learned offline. We show across a range of settings, many offline-to-online algorithms exhibit either (1) performance degradation or (2) slow learning (sometimes effectively no improvement) during fine-tuning. We introduce a new fine-tuning algorithm, based on an algorithm called Jump Start, that gradually allows more exploration based on online estimates of performance. Empirically, this approach achieves fast fine-tuning and significantly reduces performance degradations compared with existing algorithms designed to do the same.