Evaluating Interpretable Reinforcement Learning by Distilling Policies into Programs

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

In high-reliability domains (e.g., healthcare), quantifying the interpretability of reinforcement learning (RL) policies remains challenging due to the lack of objective evaluation criteria and heavy reliance on costly human assessments. Method: We propose the first fully automated, human-free interpretability evaluation paradigm, built upon a simulatability-based empirical framework that integrates program distillation, imitation learning, and symbolic program generation, complemented by computationally tractable interpretability metrics. Contributions/Results: (1) The first scalable, human-free quantitative assessment of RL policy interpretability; (2) Empirical evidence that interpretability and task performance are non-negatively correlated—and synergistically improved in certain settings; (3) Refutation of the existence of a universally optimal policy class across tasks; (4) Strong agreement between automated evaluations and user studies, with all evaluation protocols and baseline code publicly released.

Technology Category

Application Category

📝 Abstract

There exist applications of reinforcement learning like medicine where policies need to be ''interpretable'' by humans. User studies have shown that some policy classes might be more interpretable than others. However, it is costly to conduct human studies of policy interpretability. Furthermore, there is no clear definition of policy interpretabiliy, i.e., no clear metrics for interpretability and thus claims depend on the chosen definition. We tackle the problem of empirically evaluating policies interpretability without humans. Despite this lack of clear definition, researchers agree on the notions of ''simulatability'': policy interpretability should relate to how humans understand policy actions given states. To advance research in interpretable reinforcement learning, we contribute a new methodology to evaluate policy interpretability. This new methodology relies on proxies for simulatability that we use to conduct a large-scale empirical evaluation of policy interpretability. We use imitation learning to compute baseline policies by distilling expert neural networks into small programs. We then show that using our methodology to evaluate the baselines interpretability leads to similar conclusions as user studies. We show that increasing interpretability does not necessarily reduce performances and can sometimes increase them. We also show that there is no policy class that better trades off interpretability and performance across tasks making it necessary for researcher to have methodologies for comparing policies interpretability.

Problem

Research questions and friction points this paper is trying to address.

Evaluating interpretability of reinforcement learning policies without human studies.

Developing a methodology to assess policy interpretability using simulatability proxies.

Comparing interpretability and performance trade-offs across different policy classes.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distills expert neural networks into small programs

Uses proxies for simulatability to evaluate interpretability

Imitation learning for baseline policy computation

🔎 Similar Papers

No similar papers found.