🤖 AI Summary
This study addresses key challenges in medical reinforcement learning—namely sparse rewards, unreliable off-policy evaluation, and the deployment-simulation gap—by focusing on chronic disease management formulated as a constrained Markov decision process aimed at minimizing time-to-control (TTC). The work innovatively incorporates execution intensity (ε) and physician capability (κ) as structural components within a dual-loop architecture that integrates clinical preference learning with offline reinforcement learning. A hierarchical reward mechanism grounded in the CMS ACCESS model is introduced to better align with clinical objectives. Evaluated in simulated environments for hypertension and type 2 diabetes, the proposed capability-weighted approach improves TTC by 15 percentage points over uniform weighting and behavior policies, while ε-aware policies demonstrate strong cross-scenario generalization.
📝 Abstract
Reinforcement learning (RL) in healthcare has had mixed results, with reward sparsity, unreliable off-policy evaluation, and deployment-simulation gap as recurring failure modes. We argue that chronic disease management is structurally a more tractable RL setting than the acute-care problems the field has primarily studied, but only if the problem is formalized to exploit chronic care's properties. We propose such a formalization. The agent's objective is to compress time-to-control (TTC) under a tiered reward calibrated to the CMS ACCESS Model. Two quantities from our companion preference-learning paper [Singh et al. 2026] enter as load-bearing structural elements: the execution intensity εbounds action availability under a constrained Markov Decision Process, and the clinician capability κweights offline-data transitions during RL training. Together they couple preference learning and RL into a two-loop architecture. We present simulation results on synthetic state machines for hypertension and type 2 diabetes. Capability-weighted offline RL outperforms uniform-weighted offline RL and the behavior policy by 15 percentage points on T2D TTC; the uniform-weighted formulation (the standard in existing healthcare RL) underperforms even the heterogeneous behavior policy. \Epsilon-aware policies generalize across deployment regimes while ε-naive policies do not.