🤖 AI Summary
The lack of interpretability in reinforcement learning (RL) policies undermines their safety, reliability, and fairness. Method: We propose a functional modularization framework for policy networks that jointly optimizes structural sparsity and functional decoupling via non-local connection penalization; automatically identifies functional roles using graph community detection (Louvain/Leiden) on learned weight graphs; and enforces module integrity and cognitive traceability through targeted weight intervention and sparse regularization. Contribution/Results: Evaluated on stochastic MiniGrid environments, our method successfully isolates distinct modules for X- and Y-axis motion evaluation. Intervention-based validation achieves >92% accuracy in functional attribution, significantly enhancing the interpretability, verifiability, and scalability of policy behavior attribution.
📝 Abstract
Interpretability in reinforcement learning is crucial for ensuring AI systems align with human values and fulfill the diverse related requirements including safety, robustness and fairness. Building on recent approaches to encouraging sparsity and locality in neural networks, we demonstrate how the penalisation of non-local weights leads to the emergence of functionally independent modules in the policy network of a reinforcement learning agent. To illustrate this, we demonstrate the emergence of two parallel modules for assessment of movement along the X and Y axes in a stochastic Minigrid environment. Through the novel application of community detection algorithms, we show how these modules can be automatically identified and their functional roles verified through direct intervention on the network weights prior to inference. This establishes a scalable framework for reinforcement learning interpretability through functional modularity, addressing challenges regarding the trade-off between completeness and cognitive tractability of reinforcement learning explanations.