๐ค AI Summary
Traditional inverse reinforcement learning assumes static rewards, making it ill-suited for modeling agents that dynamically switch goals during a task. This work proposes PRISM, the first multi-intention inverse reinforcement learning framework to incorporate a lightweight recurrent network, which models a discrete intention distribution at each time step based on observation historyโthereby circumventing both the Markov assumption and fixed-window state augmentation. We derive a decomposable EM objective that admits a closed-form E-step solution with O(nK) complexity, eliminating the need for variational approximations. Experiments demonstrate that PRISM achieves state-of-the-art leave-one-out log-likelihood across non-Markovian grid worlds, mouse maze navigation, and the BridgeData V2 robotic manipulation benchmark, while recovering temporally coherent and interpretable intention sequences from unlabeled demonstrations.
๐ Abstract
Inverse reinforcement learning (IRL) recovers reward functions from observed behavior, yet traditional methods assume a single stationary reward that cannot capture goal switching within an episode. Recent multi-intention IRL methods address this by segmenting trajectories, but model intention transitions as either a memoryless Markov chain or via manual state augmentation with a fixed history window. We propose the Probabilistic Recurrent Intention Switching Model (PRISM), which replaces both mechanisms with a lightweight recurrent network that maps observation history to a per-step intention distribution. We prove that the resulting EM objective decomposes exactly into independent per-intention reward subproblems, each solvable in closed form, yielding an $\mathcal{O}(nK)$ E-step with no variational approximation. We evaluate PRISM on a non-Markovian gridworld, a mouse labyrinth, and BridgeData~V2 robotic manipulation, the first large-scale robotic application of multi-intention IRL. Across all settings PRISM achieves the highest held-out log-likelihood while recovering nameable, temporally coherent intentions from unlabeled demonstrations, suggesting that discrete goal switching is present in both biological and artificial agents.