🤖 AI Summary
This paper addresses the challenge of dynamics mismatch between source and target domains in offline reinforcement learning. We propose a cross-domain representation-driven policy transfer method. Our approach first constructs a shared latent-state representation across domains to enable transferable modeling of the target-domain dynamics using source-domain knowledge. Second, it leverages this learned dynamics model to perform multi-step rollouts, generating high-quality synthetic data that alleviates severe data scarcity in the target domain. Third, it introduces a Q-weighted behavior cloning loss to guide policy learning toward high-value actions. Evaluated on the MuJoCo benchmark, our method substantially outperforms existing state-of-the-art approaches—particularly under large dynamics discrepancies and extremely sparse target-domain offline datasets. The results demonstrate significant improvements in both sample efficiency and asymptotic performance, validating the effectiveness of cross-domain representation learning for offline policy transfer.
📝 Abstract
We study the off-dynamics offline reinforcement learning problem, where the goal is to learn a policy from offline datasets collected from source and target domains with mismatched transition. Existing off-dynamics offline RL methods typically either filter source transitions that resemble those of the target domain or apply reward augmentation to source data, both constrained by the limited transitions available from the target domain. As a result, the learned policy is unable to explore target domain beyond the offline datasets. We propose MOBODY, a Model-Based Off-Dynamics offline RL algorithm that addresses this limitation by enabling exploration of the target domain via learned dynamics. MOBODY generates new synthetic transitions in the target domain through model rollouts, which are used as data augmentation during offline policy learning. Unlike existing model-based methods that learn dynamics from a single domain, MOBODY tackles the challenge of mismatched dynamics by leveraging both source and target datasets. Directly merging these datasets can bias the learned model toward source dynamics. Instead, MOBODY learns target dynamics by discovering a shared latent representation of states and transitions across domains through representation learning. To stabilize training, MOBODY incorporates a behavior cloning loss that regularizes the policy. Specifically, we introduce a Q-weighted behavior cloning loss that regularizes the policy toward actions with high target-domain Q-values, rather than uniformly imitating all actions in the dataset. These Q-values are learned from an enhanced target dataset composed of offline target data, augmented source data, and rollout data from the learned target dynamics. We evaluate MOBODY on MuJoCo benchmarks and show that it significantly outperforms state-of-the-art baselines, with especially pronounced improvements in challenging scenarios.