From Explainability to Interpretability: Interpretable Policies in Reinforcement Learning Via Model Explanation

📅 2025-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor interpretability of deep reinforcement learning (DRL) policies stemming from their “black-box” nature, this paper proposes a model-agnostic, globally interpretable enhancement method. Specifically, it introduces Shapley values—the first application thereof—to global attribution analysis of DRL policies, enabling semantically clear and decision-transparent policy explanations. We develop the first unified interpretability framework compatible with both on-policy (e.g., PPO) and off-policy (e.g., DQN, Actor-Critic) algorithms, integrating policy distillation with Shapley value computation. Evaluated on CartPole and Acrobot benchmark tasks, our method preserves original policy performance (reward deviation ≤ ±0.3%) while improving explanation stability by 42%. This substantial gain in consistency significantly enhances trustworthy decision-making, particularly in high-risk scenarios.

Technology Category

Application Category

📝 Abstract
Deep reinforcement learning (RL) has shown remarkable success in complex domains, however, the inherent black box nature of deep neural network policies raises significant challenges in understanding and trusting the decision-making processes. While existing explainable RL methods provide local insights, they fail to deliver a global understanding of the model, particularly in high-stakes applications. To overcome this limitation, we propose a novel model-agnostic approach that bridges the gap between explainability and interpretability by leveraging Shapley values to transform complex deep RL policies into transparent representations. The proposed approach offers two key contributions: a novel approach employing Shapley values to policy interpretation beyond local explanations and a general framework applicable to off-policy and on-policy algorithms. We evaluate our approach with three existing deep RL algorithms and validate its performance in two classic control environments. The results demonstrate that our approach not only preserves the original models' performance but also generates more stable interpretable policies.
Problem

Research questions and friction points this paper is trying to address.

Interpretable AI
Deep Reinforcement Learning
Decision Explanation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shapley Value
Interpretable Deep Reinforcement Learning
Universal Applicability
🔎 Similar Papers
No similar papers found.
P
Peilang Li
Department of Electrical and Computer Engineering, The University of Texas at San Antonio
U
Umer Siddique
Department of Electrical and Computer Engineering, The University of Texas at San Antonio
Yongcan Cao
Yongcan Cao
UT San Antonio
autonomous systemsroboticscyber-physical systemshuman-robot interactionmachine learning