π€ AI Summary
This work addresses the challenge of effectively transferring reward functions learned from expert demonstrations in a source environment to a target reinforcement learning setting. To this end, the authors propose a coupled modeling approach that jointly constructs Bellman equations for both source and target domains, enabling a minimax estimation framework to directly solve for the soft Q-functions in both domains simultaneously. This circumvents the error propagation inherent in sequential estimation procedures. Theoretical analysis demonstrates that the method eliminates the first-order influence of source-domain Bellman residuals on the target policy and establishes finite-sample error bounds for the soft Q-functions as well as policy regret bounds. Empirical evaluation on a sepsis simulator shows that the proposed method outperforms conventional sequential transfer strategies.
π Abstract
We study the transfer of rewards learned using inverse reinforcement learning from expert demonstrations in one environment to reinforcement learning in a new, different environment. This arises naturally when demonstrations are collected in a controlled environment. We formulate the problem as a joint system of Bellman equations across the source and target environments and develop minimax estimators for the target soft-$q$-function. Whereas a sequential solution approach first estimates the source reward and then plugs it into the target control problem, a coupled approach solves the source and target system of equations jointly. We show that, in contrast to the sequential approach, the coupled approach removes the first-order influence of source Bellman residual error. We characterize the local behavior of each approach, develop finite-sample soft-$q$-function error bounds, and prove regret guarantees for the resulting soft-control policy. An empirical investigation using a sepsis simulator validates the theoretical comparison.