SAMG: Offline-to-Online Reinforcement Learning via State-Action-Conditional Offline Model Guidance

📅 2024-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing offline-to-online reinforcement learning (O2O RL) methods rely on the original offline dataset to mitigate out-of-distribution (OOD) issues, resulting in low online sampling efficiency. To address this, we propose a state-action-conditional offline model guidance mechanism: the offline critic network is frozen, while learnable state-action-adaptive weighting coefficients are introduced to enable compact, data-free transfer of offline knowledge. Theoretical analysis establishes tighter convergence bounds and reduced Q-value estimation error for our method. Evaluated on the D4RL benchmark, our approach significantly outperforms existing state-of-the-art (SOTA) methods, achieving substantial improvements in both sample efficiency and final policy performance.

Technology Category

Application Category

📝 Abstract
Offline-to-online (O2O) reinforcement learning (RL) pre-trains models on offline data and refines policies through online fine-tuning. However, existing O2O RL algorithms typically require maintaining the tedious offline datasets to mitigate the effects of out-of-distribution (OOD) data, which significantly limits their efficiency in exploiting online samples. To address this deficiency, we introduce a new paradigm for O2O RL called State-Action-Conditional Offline Model Guidance (SAMG). It freezes the pre-trained offline critic to provide compact offline understanding for each state-action sample, thus eliminating the need for retraining on offline data. The frozen offline critic is incorporated with the online target critic weighted by a state-action-adaptive coefficient. This coefficient aims to capture the offline degree of samples at the state-action level, and is updated adaptively during training. In practice, SAMG could be easily integrated with Q-function-based algorithms. Theoretical analysis shows good optimality and lower estimation error. Empirically, SAMG outperforms state-of-the-art O2O RL algorithms on the D4RL benchmark.
Problem

Research questions and friction points this paper is trying to address.

Improves efficiency in exploiting online samples
Eliminates need for retraining on offline data
Reduces estimation error in O2O RL algorithms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Freezes pre-trained offline critic
Uses state-action-adaptive coefficient
Integrates with Q-function-based algorithms
🔎 Similar Papers
2024-05-23Trans. Mach. Learn. Res.Citations: 0
L
Liyu Zhang
Zhejiang University
H
Haochi Wu
Zhejiang University
Xu Wan
Xu Wan
Zhejiang University
Reinforcement LearningLarge Language ModelLarge-scale Application
Q
Quan Kong
Zhejiang University
Ruilong Deng
Ruilong Deng
Professor, Zhejiang University
Smart GridCyber SecurityControl Systems
M
Mingyang Sun
Peking University