π€ AI Summary
To address the challenge of policy learning under scarce target-task samples in high-dimensional, non-stationary, finite-horizon MDPs, this paper proposes an offline transfer reinforcement learning framework tailored for Q-function learning. Methodologically, it introduces a novel re-targeting mechanism to enable vertical multi-step information cascading transfer; integrates heterogeneous offline data sources; supports both batch and online learning modes; and models sourceβtarget relationships via a similarity assumption. Theoretically, it establishes, for the first time, rigorous guarantees on accelerated Q-function convergence rates under offline transfer, as well as upper bounds on regret reduction in offline-to-online transfer settings. Empirical evaluations on synthetic benchmarks and real-world clinical datasets demonstrate that the proposed method significantly outperforms standard Q-learning and state-of-the-art transfer RL baselines.
π Abstract
We consider $Q$-learning with knowledge transfer, using samples from a target reinforcement learning (RL) task as well as source samples from different but related RL tasks. We propose transfer learning algorithms for both batch and online $Q$-learning with offline source studies. The proposed transferred $Q$-learning algorithm contains a novel re-targeting step that enables vertical information-cascading along multiple steps in an RL task, besides the usual horizontal information-gathering as transfer learning (TL) for supervised learning. We establish the first theoretical justifications of TL in RL tasks by showing a faster rate of convergence of the $Q$ function estimation in the offline RL transfer, and a lower regret bound in the offline-to-online RL transfer under certain similarity assumptions. Empirical evidences from both synthetic and real datasets are presented to back up the proposed algorithm and our theoretical results.