🤖 AI Summary
This study investigates the impact of temporal discretization—specifically, timestep granularity—on the performance of offline reinforcement learning (RL) policies for sepsis treatment. Motivated by evidence that conventional 4-hour timesteps may distort patient dynamics and yield suboptimal policies, we develop a unified offline RL framework to systematically evaluate 1-, 2-, 4-, and 8-hour timesteps across state representation, behavior cloning, policy training, and offline evaluation. To ensure fair cross-granularity comparison, we propose an action remapping technique and a timestep-aware model selection mechanism. Experimental results under a static behavioral policy demonstrate that 1- and 2-hour timesteps significantly improve both policy performance and stability over the standard 4-hour setting. These findings validate that finer temporal resolution enhances the reliability of clinical decision support and provides critical design guidance for temporal modeling in healthcare RL.
📝 Abstract
Existing studies on reinforcement learning (RL) for sepsis management have mostly followed an established problem setup, in which patient data are aggregated into 4-hour time steps. Although concerns have been raised regarding the coarseness of this time-step size, which might distort patient dynamics and lead to suboptimal treatment policies, the extent to which this is a problem in practice remains unexplored. In this work, we conducted empirical experiments for a controlled comparison of four time-step sizes ($Δt!=!1,2,4,8$ h) on this domain, following an identical offline RL pipeline. To enable a fair comparison across time-step sizes, we designed action re-mapping methods that allow for evaluation of policies on datasets with different time-step sizes, and conducted cross-$Δt$ model selections under two policy learning setups. Our goal was to quantify how time-step size influences state representation learning, behavior cloning, policy training, and off-policy evaluation. Our results show that performance trends across $Δt$ vary as learning setups change, while policies learned at finer time-step sizes ($Δt = 1$ h and $2$ h) using a static behavior policy achieve the overall best performance and stability. Our work highlights time-step size as a core design choice in offline RL for healthcare and provides evidence supporting alternatives beyond the conventional 4-hour setup.