POLAR: A Pessimistic Model-based Policy Learning Algorithm for Dynamic Treatment Regimes

📅 2025-06-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing dynamic treatment regime (DTR) methods suffer from policy fragility due to insufficient data coverage and strong positivity assumptions. Method: We propose a pessimistic model-based offline reinforcement learning algorithm that models state-transition dynamics, quantifies estimation uncertainty, and incorporates a pessimistic penalty term into the reward function to directly optimize an upper bound on policy suboptimality. Contribution/Results: This is the first approach within the model-based DTR framework to simultaneously provide finite-sample statistical guarantees and computational efficiency—bypassing computationally intensive optimization routines required by prior methods. Evaluated on synthetic benchmarks and real-world MIMIC-III clinical data, our method significantly outperforms state-of-the-art baselines, yielding history-dependent, robust, and near-optimal personalized treatment policies.

Technology Category

Application Category

📝 Abstract
Dynamic treatment regimes (DTRs) provide a principled framework for optimizing sequential decision-making in domains where decisions must adapt over time in response to individual trajectories, such as healthcare, education, and digital interventions. However, existing statistical methods often rely on strong positivity assumptions and lack robustness under partial data coverage, while offline reinforcement learning approaches typically focus on average training performance, lack statistical guarantees, and require solving complex optimization problems. To address these challenges, we propose POLAR, a novel pessimistic model-based policy learning algorithm for offline DTR optimization. POLAR estimates the transition dynamics from offline data and quantifies uncertainty for each history-action pair. A pessimistic penalty is then incorporated into the reward function to discourage actions with high uncertainty. Unlike many existing methods that focus on average training performance, POLAR directly targets the suboptimality of the final learned policy and offers theoretical guarantees, without relying on computationally intensive minimax or constrained optimization procedures. To the best of our knowledge, POLAR is the first model-based DTR method to provide both statistical and computational guarantees, including finite-sample bounds on policy suboptimality. Empirical results on both synthetic data and the MIMIC-III dataset demonstrate that POLAR outperforms state-of-the-art methods and yields near-optimal, history-aware treatment strategies.
Problem

Research questions and friction points this paper is trying to address.

Optimizing sequential decision-making in dynamic treatment regimes
Addressing lack of robustness under partial data coverage
Providing statistical guarantees without complex optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pessimistic model-based policy learning algorithm
Uncertainty quantification for history-action pairs
Theoretical guarantees without complex optimization
🔎 Similar Papers
No similar papers found.
Ruijia Zhang
Ruijia Zhang
Johns Hopkins University, Department of Applied Mathematics and Statistics
OptimizationReinforcement LearningApplied Probability
Z
Zhengling Qi
School of Business, The George Washington University
Y
Yue Wu
Department of Applied Mathematics and Statistics, Johns Hopkins University
X
Xiangyu Zhang
Department of Applied Mathematics and Statistics, Johns Hopkins University
Yanxun Xu
Yanxun Xu
Johns Hopkins University
BayesianClinical trial DesignElectronic Health Record DataNetwork Data