🤖 AI Summary
This work addresses offline dynamic discrete choice (DDC) and offline maximum-entropy inverse reinforcement learning (MaxEnt-IRL), aiming to unbiasedly recover the reward function or optimal Q*-function from static behavioral data. We propose the first unified framework based on empirical risk minimization that neither assumes linear rewards nor requires knowledge of the state transition model. Theoretically, we prove that the Bellman residual satisfies the Polyak–Łojasiewicz condition, ensuring global and fast convergence of gradient-based optimization. The framework accommodates nonparametric function classes, including neural networks. Extensive experiments on multiple high-dimensional and infinite-state synthetic benchmarks demonstrate consistent superiority over state-of-the-art baselines, with significant improvements in estimation accuracy and scalability. Our approach establishes a new paradigm for offline IRL/DDC that simultaneously delivers rigorous theoretical guarantees and strong practical performance.
📝 Abstract
We study the problem of estimating Dynamic Discrete Choice (DDC) models, also known as offline Maximum Entropy-Regularized Inverse Reinforcement Learning (offline MaxEnt-IRL) in machine learning. The objective is to recover reward or $Q^*$ functions that govern agent behavior from offline behavior data. In this paper, we propose a globally convergent gradient-based method for solving these problems without the restrictive assumption of linearly parameterized rewards. The novelty of our approach lies in introducing the Empirical Risk Minimization (ERM) based IRL/DDC framework, which circumvents the need for explicit state transition probability estimation in the Bellman equation. Furthermore, our method is compatible with non-parametric estimation techniques such as neural networks. Therefore, the proposed method has the potential to be scaled to high-dimensional, infinite state spaces. A key theoretical insight underlying our approach is that the Bellman residual satisfies the Polyak-Lojasiewicz (PL) condition -- a property that, while weaker than strong convexity, is sufficient to ensure fast global convergence guarantees. Through a series of synthetic experiments, we demonstrate that our approach consistently outperforms benchmark methods and state-of-the-art alternatives.