Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the high computational cost of traditional inverse reinforcement learning, which requires repeatedly solving full reinforcement learning problems to ensure monotonic performance improvement, and the instability and lack of monotonicity in adversarial approaches. The authors propose a trust-region-based explicit dual optimization framework that jointly optimizes the reward function and policy within a local neighborhood of the current policy, thereby avoiding complete RL solves. A key theoretical insight is that small-step updates along the reward gradient direction preserve global optimality through locally optimal policies, achieving a balance among monotonicity, stability, and computational efficiency while recovering generalizable canonical reward functions. Experiments demonstrate that the method outperforms state-of-the-art imitation learning approaches by 2.4× in aggregate performance across multiple complex tasks, and the learned rewards generalize effectively to environments with altered dynamics.

📝 Abstract

Inverse reinforcement learning (IRL) is typically formulated as maximizing entropy subject to matching the distribution of expert trajectories. Classical (dual-ascent) IRL guarantees monotonic performance improvement but requires fully solving an RL problem each iteration to compute dual gradients. More recent adversarial methods avoid this cost at the expense of stability and monotonic dual improvement, by directly optimizing the primal problem and using a discriminator to provide rewards. In this work, we bridge the gap between these approaches by enabling monotonic improvement of the reward function and policy without having to fully solve an RL problem at every iteration. Our key theoretical insight is that a trust-region-optimal policy for a reward function update can be globally optimal for a smaller update in the same direction. This smaller update allows us to explicitly optimize the dual objective while only relying on a local search around the current policy. In doing so, our approach avoids the training instabilities of adversarial methods, offers monotonic performance improvement, and learns a reward function in the traditional sense of IRL--one that can be globally optimized to match expert demonstrations. Our proposed algorithm, Trust Region Inverse Reinforcement Learning (TRIRL), outperforms state-of-the-art imitation learning methods across multiple challenging tasks by a factor of 2.4x in terms of aggregate inter-quartile mean, while recovering reward functions that generalize to system dynamics shifts.

Problem

Research questions and friction points this paper is trying to address.

Inverse Reinforcement Learning

Monotonic Improvement

Trust Region

Reward Function Learning

Imitation Learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trust Region

Inverse Reinforcement Learning

Dual Ascent