SAMG: Offline-to-Online Reinforcement Learning via State-Action-Conditional Offline Model Guidance

📅 2024-10-24

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing offline-to-online reinforcement learning (O2O RL) methods rely on the original offline dataset to mitigate out-of-distribution (OOD) issues, resulting in low online sampling efficiency. To address this, we propose a state-action-conditional offline model guidance mechanism: the offline critic network is frozen, while learnable state-action-adaptive weighting coefficients are introduced to enable compact, data-free transfer of offline knowledge. Theoretical analysis establishes tighter convergence bounds and reduced Q-value estimation error for our method. Evaluated on the D4RL benchmark, our approach significantly outperforms existing state-of-the-art (SOTA) methods, achieving substantial improvements in both sample efficiency and final policy performance.

Technology Category

Application Category

📝 Abstract

Offline-to-online (O2O) reinforcement learning (RL) pre-trains models on offline data and refines policies through online fine-tuning. However, existing O2O RL algorithms typically require maintaining the tedious offline datasets to mitigate the effects of out-of-distribution (OOD) data, which significantly limits their efficiency in exploiting online samples. To address this deficiency, we introduce a new paradigm for O2O RL called State-Action-Conditional Offline Model Guidance (SAMG). It freezes the pre-trained offline critic to provide compact offline understanding for each state-action sample, thus eliminating the need for retraining on offline data. The frozen offline critic is incorporated with the online target critic weighted by a state-action-adaptive coefficient. This coefficient aims to capture the offline degree of samples at the state-action level, and is updated adaptively during training. In practice, SAMG could be easily integrated with Q-function-based algorithms. Theoretical analysis shows good optimality and lower estimation error. Empirically, SAMG outperforms state-of-the-art O2O RL algorithms on the D4RL benchmark.

Problem

Research questions and friction points this paper is trying to address.

Improves efficiency in exploiting online samples

Eliminates need for retraining on offline data

Reduces estimation error in O2O RL algorithms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Freezes pre-trained offline critic

Uses state-action-adaptive coefficient

Integrates with Q-function-based algorithms

🔎 Similar Papers

State-Constrained Offline Reinforcement Learning