🤖 AI Summary
This work addresses the problem of inverse reward machine inference from partially observable optimal policies. We introduce the novel concept of *prefix-tree policies*, which explicitly model the joint dependency between action distributions and sequences of atomic propositions. We establish a theoretical framework for reward machine identifiability, define its equivalence classes, and—crucially—provide the first exact recovery guarantee under policy depth constraints. Our method integrates prefix-tree representation, SAT encoding, MDP state abstraction, and formal verification, and we rigorously prove that, within bounded policy depth, the true reward machine can be uniquely recovered up to equivalence. Empirical evaluation across multiple benchmark domains demonstrates the algorithm’s effectiveness, robustness to partial observability, and scalability.
📝 Abstract
Inverse reinforcement learning is the problem of inferring a reward function from an optimal policy. In this work, it is assumed that the reward is expressed as a reward machine whose transitions depend on atomic propositions associated with the state of a Markov Decision Process (MDP). Our goal is to identify the true reward machine using finite information. To this end, we first introduce the notion of a prefix tree policy which associates a distribution of actions to each state of the MDP and each attainable finite sequence of atomic propositions. Then, we characterize an equivalence class of reward machines that can be identified given the prefix tree policy. Finally, we propose a SAT-based algorithm that uses information extracted from the prefix tree policy to solve for a reward machine. It is proved that if the prefix tree policy is known up to a sufficient (but finite) depth, our algorithm recovers the exact reward machine up to the equivalence class. This sufficient depth is derived as a function of the number of MDP states and (an upper bound on) the number of states of the reward machine. Several examples are used to demonstrate the effectiveness of the approach.