Inverse Q-Learning Done Right: Offline Imitation Learning in $Q^pi$-Realizable MDPs

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work studies offline imitation learning under $Q^pi$-realizable MDPs—learning high-performance policies solely from expert state-action demonstration data. We propose SPOIL (Stable Policy Optimization via Imitation Learning), the first algorithm in this setting achieving $ ilde{O}(varepsilon^{-2})$ sample complexity. SPOIL introduces a novel critic network loss function that substantially improves training stability in deep imitation learning. Leveraging saddle-point optimization and function approximation theory, we provide rigorous theoretical guarantees: the learned policy’s performance gap is bounded by $varepsilon$ with high probability. Empirically, its neural implementation significantly outperforms behavioral cloning and matches the performance of state-of-the-art offline reinforcement learning methods across benchmark tasks.

Technology Category

Application Category

📝 Abstract
We study the problem of offline imitation learning in Markov decision processes (MDPs), where the goal is to learn a well-performing policy given a dataset of state-action pairs generated by an expert policy. Complementing a recent line of work on this topic that assumes the expert belongs to a tractable class of known policies, we approach this problem from a new angle and leverage a different type of structural assumption about the environment. Specifically, for the class of linear $Q^pi$-realizable MDPs, we introduce a new algorithm called saddle-point offline imitation learning (SPOIL), which is guaranteed to match the performance of any expert up to an additive error $varepsilon$ with access to $mathcal{O}(varepsilon^{-2})$ samples. Moreover, we extend this result to possibly non-linear $Q^pi$-realizable MDPs at the cost of a worse sample complexity of order $mathcal{O}(varepsilon^{-4})$. Finally, our analysis suggests a new loss function for training critic networks from expert data in deep imitation learning. Empirical evaluations on standard benchmarks demonstrate that the neural net implementation of SPOIL is superior to behavior cloning and competitive with state-of-the-art algorithms.
Problem

Research questions and friction points this paper is trying to address.

Offline imitation learning in MDPs with expert data
Performance matching in Qπ-realizable MDPs with SPOIL
New loss function for critic networks in deep imitation
Innovation

Methods, ideas, or system contributions that make the work stand out.

SPOIL algorithm for offline imitation learning
Linear and non-linear Qπ-realizable MDPs
New loss function for critic networks
🔎 Similar Papers
No similar papers found.