🤖 AI Summary
This work addresses the limitations of human-selected interpretable concepts in reinforcement learning—namely their reliance on domain expertise, high cost, and lack of performance guarantees. The authors formulate concept selection as a state abstraction problem and introduce a “decision relevance” criterion: retaining only those concepts that distinguish states requiring different actions, thereby ensuring that states sharing the same concept also share an optimal action. Building on this principle, they propose DRS, the first algorithm for automatic concept selection in sequential decision-making, which integrates state abstraction theory to define a decision relevance metric and jointly optimizes candidate concept filtering with policy learning. Theoretical analysis provides error bounds linking selected concepts to policy performance, and experiments demonstrate that DRS automatically recovers or even surpasses handcrafted concept sets across multiple RL benchmarks and a real-world clinical environment, significantly improving the efficacy of concept-based interventions at test time.
📝 Abstract
Training interpretable concept-based policies requires practitioners to manually select which human-understandable concepts an agent should reason with when making sequential decisions. This selection demands domain expertise, is time-consuming and costly, scales poorly with the number of candidates, and provides no performance guarantees. To overcome this limitation, we propose the first algorithms for principled automatic concept selection in sequential decision-making. Our key insight is that concept selection can be viewed through the lens of state abstraction: intuitively, a concept is decision-relevant if removing it would cause the agent to confuse states that require different actions. As a result, agents should rely on decision-relevant concepts; states with the same concept representation should share the same optimal action, which preserves the optimal decision structure of the original state space. This perspective leads to the Decision-Relevant Selection (DRS) algorithm, which selects a subset of concepts from a candidate set, along with performance bounds relating the selected concepts to the performance of the resulting policy. Empirically, DRS automatically recovers manually curated concept sets while matching or exceeding their performance, and improves the effectiveness of test-time concept interventions across reinforcement learning benchmarks and real-world healthcare environments.