π€ AI Summary
This work addresses inverse reinforcement learning (IRL) for finite-horizon Markov decision processes under linear reward classes, proposing an entropy-regularized minimax IRL framework that enables efficient learning without requiring exploratory assumptions and while accommodating model misspecification. By establishing the equivalence between maximum likelihood estimation and minimax IRLβboth in population and empirical formsβand introducing a pseudo-self-consistency analysis of the loss function, the paper proves, for the first time in general Borel state-action spaces, that with only $n$ expert trajectories, both trajectory-level KL divergence and parameter estimation error converge at an $O(n^{-1})$ rate under the Hessian norm, substantially improving upon the classical $O(n^{-1/2})$ rate. The study also extends reward identifiability theory and derives a novel expression for the derivative of the soft optimal value function with respect to reward parameters.
π Abstract
We establish novel structural and statistical results for entropy-regularized min-max inverse reinforcement learning (Min-Max-IRL) with linear reward classes in finite-horizon MDPs with Borel state and action spaces. On the structural side, we show that maximum likelihood estimation (MLE) and Min-Max-IRL are equivalent at the population level, and at the empirical level under deterministic dynamics. On the statistical side, exploiting pseudo-self-concordance of the Min-Max-IRL loss, we prove that both the trajectory-level KL divergence and the squared parameter error in the Hessian norm decay at the fast rate $\mathcal{O}(n^{-1})$, where $n$ is the number of expert trajectories. Our guarantees apply under misspecification and require no exploration assumptions. We further extend reward-identifiability results to general Borel spaces and derive novel results on the derivatives of the soft-optimal value function with respect to reward parameters.