Explainable Reinforcement Learning Agents Using World Models

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

To address the lack of interpretability in model-based deep reinforcement learning (MBRL) agents’ decision-making for non-expert users, this paper proposes a causal explanation framework based on synergistic forward and inverse world models. The core innovation is the first formalization of an inverse world model, which enables backward inference from target actions to requisite environmental states; integrated with a forward world model, it generates counterfactual trajectories that reveal the policy’s dependence on environmental states. Crucially, the method operates without access to the agent’s internal policy or reward function, yielding visualizable, causally attributable action explanations. User studies demonstrate that, compared to baseline approaches, our method improves users’ understanding of the policy by 32.7%, increases behavioral prediction accuracy by 28.4%, and significantly enhances their ability to guide the agent toward target actions via environmental interventions.

Technology Category

Application Category

📝 Abstract

Explainable AI (XAI) systems have been proposed to help people understand how AI systems produce outputs and behaviors. Explainable Reinforcement Learning (XRL) has an added complexity due to the temporal nature of sequential decision-making. Further, non-AI experts do not necessarily have the ability to alter an agent or its policy. We introduce a technique for using World Models to generate explanations for Model-Based Deep RL agents. World Models predict how the world will change when actions are performed, allowing for the generation of counterfactual trajectories. However, identifying what a user wanted the agent to do is not enough to understand why the agent did something else. We augment Model-Based RL agents with a Reverse World Model, which predicts what the state of the world should have been for the agent to prefer a given counterfactual action. We show that explanations that show users what the world should have been like significantly increase their understanding of the agent policy. We hypothesize that our explanations can help users learn how to control the agents execution through by manipulating the environment.

Problem

Research questions and friction points this paper is trying to address.

Explain RL agent decisions using World Models

Generate counterfactual trajectories for user understanding

Enhance policy comprehension with Reverse World Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using World Models to explain Model-Based Deep RL

Augmenting RL agents with a Reverse World Model

Generating counterfactual trajectories for user understanding

🔎 Similar Papers

No similar papers found.