When to Trust Your Simulator: Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning

📅 2022-06-27

🏛️ Neural Information Processing Systems

📈 Citations: 59

✨ Influential: 10

career value

234K/year

🤖 AI Summary

To address policy transfer failure caused by insufficient coverage in offline reinforcement learning (RL) data and dynamics discrepancies between simulation and reality, this paper proposes a hybrid RL framework that synergistically integrates limited real-world offline data with online exploration in an imperfect simulator. Its core innovation is a novel dynamics-aware Q-function adaptive penalty mechanism, which dynamically estimates and suppresses interference from high-bias simulated transitions—using either model residuals or contrastive encoding—to improve policy robustness. The framework jointly incorporates BCQ-style offline policy constraints, SAC-style online policy optimization, and explicit modeling of dynamics mismatch. Evaluated across multiple simulated and real-robot tasks, the method significantly outperforms pure offline, pure online, and cross-domain baselines. Theoretically, it yields a tighter upper bound on policy error, and empirically achieves a 2.3× improvement in sample efficiency.

📝 Abstract

Learning effective reinforcement learning (RL) policies to solve real-world complex tasks can be quite challenging without a high-fidelity simulation environment. In most cases, we are only given imperfect simulators with simplified dynamics, which inevitably lead to severe sim-to-real gaps in RL policy learning. The recently emerged field of offline RL provides another possibility to learn policies directly from pre-collected historical data. However, to achieve reasonable performance, existing offline RL algorithms need impractically large offline data with sufficient state-action space coverage for training. This brings up a new question: is it possible to combine learning from limited real data in offline RL and unrestricted exploration through imperfect simulators in online RL to address the drawbacks of both approaches? In this study, we propose the Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning (H2O) framework to provide an affirmative answer to this question. H2O introduces a dynamics-aware policy evaluation scheme, which adaptively penalizes the Q function learning on simulated state-action pairs with large dynamics gaps, while also simultaneously allowing learning from a fixed real-world dataset. Through extensive simulation and real-world tasks, as well as theoretical analysis, we demonstrate the superior performance of H2O against other cross-domain online and offline RL algorithms. H2O provides a brand new hybrid offline-and-online RL paradigm, which can potentially shed light on future RL algorithm design for solving practical real-world tasks.

Problem

Research questions and friction points this paper is trying to address.

Combining limited real data with imperfect simulators for reinforcement learning

Addressing sim-to-real gaps and insufficient offline data coverage

Developing dynamics-aware hybrid offline-online RL framework H2O

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines offline RL with online simulator training

Uses dynamics-aware policy evaluation scheme

Adaptively penalizes Q function on simulation gaps

🔎 Similar Papers

No similar papers found.