🤖 AI Summary
This work addresses the challenge of imitation learning in settings where expert demonstrations are scarce but abundant unlabeled trajectories are available. It introduces semi-supervised learning into the maximum entropy inverse reinforcement learning (MaxEnt-IRL) framework for the first time, proposing a novel trajectory-level pairwise penalty to effectively leverage unlabeled data for reward function optimization. By modeling relative relationships among trajectories, the method jointly optimizes the likelihood of both labeled and unlabeled trajectories within a probabilistic graphical model, substantially mitigating the limitations imposed by insufficient expert data. Experimental results on highway driving and grid-world tasks demonstrate that the proposed approach significantly outperforms standard MaxEnt-IRL in policy reproduction performance, highlighting the critical role of unlabeled trajectories in enhancing imitation learning outcomes.
📝 Abstract
A popular approach to apprenticeship learning (AL) is to formulate it as an inverse reinforcement learning (IRL) problem. The MaxEnt-IRL algorithm successfully integrates the maximum entropy principle into IRL and unlike its predecessors, it resolves the ambiguity arising from the fact that a possibly large number of policies could match the expert's behavior. In this paper, we study an AL setting in which in addition to the expert's trajectories, a number of unsupervised trajectories is available. We introduce MESSI, a novel algorithm that combines MaxEnt-IRL with principles coming from semi-supervised learning. In particular, MESSI integrates the unsupervised data into the MaxEnt-IRL framework using a pairwise penalty on trajectories. Empirical results in a highway driving and grid-world problems indicate that MESSI is able to take advantage of the unsupervised trajectories and improve the performance of MaxEnt-IRL.