π€ AI Summary
Learning robust reward machines (RMs) from noisy execution traces remains challenging in reinforcement learning. Method: We propose a closed-loop co-learning framework featuring: (1) Bayesian posterior belief modeling to explicitly quantify trajectory uncertainty and noise tolerance; (2) an alternating online update mechanism jointly optimizing the RM and policy; (3) the first posterior-belief-based probabilistic reward shaping, enabling stable extraction of transferable RMs under high noise; and (4) integration of inductive logic programming (ILP), finite-state machine modeling, and online RM relearning. Results: Experiments show that the learned RMs closely approximate ground-truth structures under noise, and agents guided by them achieve performance on par with those using handcrafted RM baselines. The approach demonstrates significant improvements in robustness, transferability, and practical applicability.
π Abstract
This paper presents PROB-IRM, an approach that learns
robust reward machines (RMs) for reinforcement learning
(RL) agents from noisy execution traces. The key aspect
of RM-driven RL is the exploitation of a finite-state ma-
chine that decomposes the agentβs task into different sub-
tasks. PROB-IRM uses a state-of-the-art inductive logic pro-
gramming framework robust to noisy examples to learn RMs
from noisy traces using the Bayesian posterior degree of be-
liefs, thus ensuring robustness against inconsistencies. Piv-
otal for the results is the interleaving between RM learning
and policy learning: a new RM is learned whenever the RL
agent generates a trace that is believed not to be accepted by
the current RM. To speed up the training of the RL agent,
PROB-IRM employs a probabilistic formulation of reward
shaping that uses the posterior Bayesian beliefs derived from
the traces. Our experimental analysis shows that PROB-IRM
can learn (potentially imperfect) RMs from noisy traces and
exploit them to train an RL agent to solve its tasks success-
fully. Despite the complexity of learning the RM from noisy
traces, agents trained with PROB-IRM perform comparably
to agents provided with handcrafted RMs.