Learning Reward Machines from Partially Observed Optimal Policies

📅 2025-02-06

📈 Citations: 0

✨ Influential: 0

career value

257K/year

🤖 AI Summary

This work addresses the problem of inverse reward machine inference from partially observable optimal policies. We introduce the novel concept of *prefix-tree policies*, which explicitly model the joint dependency between action distributions and sequences of atomic propositions. We establish a theoretical framework for reward machine identifiability, define its equivalence classes, and—crucially—provide the first exact recovery guarantee under policy depth constraints. Our method integrates prefix-tree representation, SAT encoding, MDP state abstraction, and formal verification, and we rigorously prove that, within bounded policy depth, the true reward machine can be uniquely recovered up to equivalence. Empirical evaluation across multiple benchmark domains demonstrates the algorithm’s effectiveness, robustness to partial observability, and scalability.

Technology Category

Application Category

📝 Abstract

Inverse reinforcement learning is the problem of inferring a reward function from an optimal policy. In this work, it is assumed that the reward is expressed as a reward machine whose transitions depend on atomic propositions associated with the state of a Markov Decision Process (MDP). Our goal is to identify the true reward machine using finite information. To this end, we first introduce the notion of a prefix tree policy which associates a distribution of actions to each state of the MDP and each attainable finite sequence of atomic propositions. Then, we characterize an equivalence class of reward machines that can be identified given the prefix tree policy. Finally, we propose a SAT-based algorithm that uses information extracted from the prefix tree policy to solve for a reward machine. It is proved that if the prefix tree policy is known up to a sufficient (but finite) depth, our algorithm recovers the exact reward machine up to the equivalence class. This sufficient depth is derived as a function of the number of MDP states and (an upper bound on) the number of states of the reward machine. Several examples are used to demonstrate the effectiveness of the approach.

Problem

Research questions and friction points this paper is trying to address.

Infer reward machine from partial policies

Identify true reward machine using finite data

Develop SAT-based algorithm for reward recovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inverse reinforcement learning

Reward machine identification

SAT-based algorithm

🔎 Similar Papers

Reward Machines for Deep RL in Noisy and Uncertain Environments