BiCQL-ML: A Bi-Level Conservative Q-Learning Framework for Maximum Likelihood Inverse Reinforcement Learning

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This paper addresses the offline inverse reinforcement learning (IRL) problem—recovering an interpretable reward function solely from static expert demonstrations, without online interaction. We propose an off-policy, bilevel optimization framework: the upper level performs maximum-likelihood reward estimation via soft value matching, while the lower level implicitly models expert behavior using conservative Q-learning (CQL), jointly optimizing the reward function and a conservative Q-function. We theoretically establish convergence to the soft-optimal reward function. Empirically, our method achieves significantly improved reward recovery accuracy and downstream policy performance on standard offline RL benchmarks, outperforming existing IRL and offline imitation learning approaches. The key innovation lies in eliminating explicit policy learning; instead, we employ an end-to-end bilevel structure that enables more robust and precise reward inference.

Technology Category

Application Category

📝 Abstract

Offline inverse reinforcement learning (IRL) aims to recover a reward function that explains expert behavior using only fixed demonstration data, without any additional online interaction. We propose BiCQL-ML, a policy-free offline IRL algorithm that jointly optimizes a reward function and a conservative Q-function in a bi-level framework, thereby avoiding explicit policy learning. The method alternates between (i) learning a conservative Q-function via Conservative Q-Learning (CQL) under the current reward, and (ii) updating the reward parameters to maximize the expected Q-values of expert actions while suppressing over-generalization to out-of-distribution actions. This procedure can be viewed as maximum likelihood estimation under a soft value matching principle. We provide theoretical guarantees that BiCQL-ML converges to a reward function under which the expert policy is soft-optimal. Empirically, we show on standard offline RL benchmarks that BiCQL-ML improves both reward recovery and downstream policy performance compared to existing offline IRL baselines.

Problem

Research questions and friction points this paper is trying to address.

Recover reward functions from fixed expert data without online interaction.

Avoid explicit policy learning by jointly optimizing reward and Q-functions.

Suppress over-generalization to out-of-distribution actions in reward recovery.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bi-level framework jointly optimizes reward and Q-function

Uses Conservative Q-Learning to suppress out-of-distribution actions

Performs maximum likelihood estimation via soft value matching

🔎 Similar Papers

Iterated $Q$-Network: Beyond One-Step Bellman Updates in Deep Reinforcement Learning