Online Pre-Training for Offline-to-Online Reinforcement Learning

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Offline pre-trained agents often suffer from value estimation bias during online fine-tuning due to distributional shift, frequently underperforming even randomly initialized policies. To address this, we propose “online pre-training”—a novel paradigm that inserts a lightweight, transfer-oriented value function adaptation phase between offline pre-training and online fine-tuning, explicitly optimizing the value function’s generalization to the online policy’s state-action distribution. Building upon TD3 and SPOT, our method implements staged value learning: first modeling broad behavioral priors offline, then calibrating Q-value estimates under the target online distribution via online pre-training. Evaluated on the D4RL benchmark (MuJoCo, Antmaze, Adroit), our approach achieves an average 30% performance gain over standard offline pre-training baselines. It is the first systematic solution to the offline-to-online value function transfer mismatch problem, significantly advancing robust policy transfer in offline-to-online reinforcement learning.

Technology Category

Application Category

📝 Abstract

Offline-to-online reinforcement learning (RL) aims to integrate the complementary strengths of offline and online RL by pre-training an agent offline and subsequently fine-tuning it through online interactions. However, recent studies reveal that offline pre-trained agents often underperform during online fine-tuning due to inaccurate value estimation caused by distribution shift, with random initialization proving more effective in certain cases. In this work, we propose a novel method, Online Pre-Training for Offline-to-Online RL (OPT), explicitly designed to address the issue of inaccurate value estimation in offline pre-trained agents. OPT introduces a new learning phase, Online Pre-Training, which allows the training of a new value function tailored specifically for effective online fine-tuning. Implementation of OPT on TD3 and SPOT demonstrates an average 30% improvement in performance across a wide range of D4RL environments, including MuJoCo, Antmaze, and Adroit.

Problem

Research questions and friction points this paper is trying to address.

Addresses inaccurate value estimation in offline pre-trained RL agents

Introduces Online Pre-Training to improve online fine-tuning performance

Enhances RL agent adaptability across diverse D4RL benchmark environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Online Pre-Training phase for RL

Tailors value function for online fine-tuning

Improves performance by 30% on D4RL

🔎 Similar Papers

No similar papers found.