🤖 AI Summary
Learning high-reliability policies for autonomous AI systems (e.g., autonomous driving) from limited, noisy human demonstrations remains challenging.
Method: This paper proposes a theoretically grounded active inverse reinforcement learning (IRL) framework. It introduces the PAC-EIG acquisition function—the first to provide Probably Approximately Correct (PAC) guarantees for active IRL under noisy expert demonstrations—and its variant Reward-EIG, tailored for reward modeling. Leveraging Bayesian active learning, the method selects maximally discriminative demonstration scenarios by optimizing information gain about policy regret, within finite state-action spaces.
Contributions/Results: We establish theoretical convergence bounds and uncover failure modes of existing heuristic approaches. Experiments demonstrate substantial improvements in both sample efficiency and policy reliability over baselines.
📝 Abstract
As AI systems become increasingly autonomous, reliably aligning their decision-making to human preferences is essential. Inverse reinforcement learning (IRL) offers a promising approach to infer preferences from demonstrations. These preferences can then be used to produce an apprentice policy that performs well on the demonstrated task. However, in domains like autonomous driving or robotics, where errors can have serious consequences, we need not just good average performance but reliable policies with formal guarantees -- yet obtaining sufficient human demonstrations for reliability guarantees can be costly. Active IRL addresses this challenge by strategically selecting the most informative scenarios for human demonstration. We introduce PAC-EIG, an information-theoretic acquisition function that directly targets probably-approximately-correct (PAC) guarantees for the learned policy -- providing the first such theoretical guarantee for active IRL with noisy expert demonstrations. Our method maximises information gain about the regret of the apprentice policy, efficiently identifying states requiring further demonstration. We also present Reward-EIG as an alternative when learning the reward itself is the primary objective. Focusing on finite state-action spaces, we prove convergence bounds, illustrate failure modes of prior heuristic methods, and demonstrate our method's advantages experimentally.