🤖 AI Summary
This paper addresses the offline inverse reinforcement learning (IRL) problem—recovering an interpretable reward function solely from static expert demonstrations, without online interaction. We propose an off-policy, bilevel optimization framework: the upper level performs maximum-likelihood reward estimation via soft value matching, while the lower level implicitly models expert behavior using conservative Q-learning (CQL), jointly optimizing the reward function and a conservative Q-function. We theoretically establish convergence to the soft-optimal reward function. Empirically, our method achieves significantly improved reward recovery accuracy and downstream policy performance on standard offline RL benchmarks, outperforming existing IRL and offline imitation learning approaches. The key innovation lies in eliminating explicit policy learning; instead, we employ an end-to-end bilevel structure that enables more robust and precise reward inference.
📝 Abstract
Offline inverse reinforcement learning (IRL) aims to recover a reward function that explains expert behavior using only fixed demonstration data, without any additional online interaction. We propose BiCQL-ML, a policy-free offline IRL algorithm that jointly optimizes a reward function and a conservative Q-function in a bi-level framework, thereby avoiding explicit policy learning. The method alternates between (i) learning a conservative Q-function via Conservative Q-Learning (CQL) under the current reward, and (ii) updating the reward parameters to maximize the expected Q-values of expert actions while suppressing over-generalization to out-of-distribution actions. This procedure can be viewed as maximum likelihood estimation under a soft value matching principle. We provide theoretical guarantees that BiCQL-ML converges to a reward function under which the expert policy is soft-optimal. Empirically, we show on standard offline RL benchmarks that BiCQL-ML improves both reward recovery and downstream policy performance compared to existing offline IRL baselines.