🤖 AI Summary
Offline reinforcement learning (RL) faces two critical challenges in safety-critical medical settings: (1) policy uninterpretability undermines clinical trust, and (2) offline policy evaluation—particularly importance sampling—is highly sensitive to behavior policy mismatch. To address these, we propose an interpretable tree-structured behavioral cloning framework that automatically clusters patient states to generate intuitive, traceable decision rules. We incorporate a high-frequency action prior to constrain the policy space, enhancing overlap with the behavior policy and improving offline evaluation stability. Furthermore, we jointly estimate the behavior policy via the tree model, integrating behavioral cloning with importance sampling for both robust policy extraction and reliable evaluation. Evaluated on real-world clinical datasets for rheumatoid arthritis and sepsis, our method achieves significant improvements over current clinical practice: +12.3% in cumulative reward, full decision-path interpretability via visualization, and 47% reduction in evaluation variance—demonstrating superior performance, transparency, and assessment robustness.
📝 Abstract
Offline reinforcement learning (RL) holds great promise for deriving optimal policies from observational data, but challenges related to interpretability and evaluation limit its practical use in safety-critical domains. Interpretability is hindered by the black-box nature of unconstrained RL policies, while evaluation -- typically performed off-policy -- is sensitive to large deviations from the data-collecting behavior policy, especially when using methods based on importance sampling. To address these challenges, we propose a simple yet practical alternative: deriving treatment policies from the most frequently chosen actions in each patient state, as estimated by an interpretable model of the behavior policy. By using a tree-based model, which is specifically designed to exploit patterns in the data, we obtain a natural grouping of states with respect to treatment. The tree structure ensures interpretability by design, while varying the number of actions considered controls the degree of overlap with the behavior policy, enabling reliable off-policy evaluation. This pragmatic approach to policy development standardizes frequent treatment patterns, capturing the collective clinical judgment embedded in the data. Using real-world examples in rheumatoid arthritis and sepsis care, we demonstrate that policies derived under this framework can outperform current practice, offering interpretable alternatives to those obtained via offline RL.