🤖 AI Summary
Supervised offline reinforcement learning (RvS) methods—e.g., Decision Transformers—exhibit significant deficiencies in target-return alignment: they struggle to interpolate reliably in data-sparse regions or extrapolate robustly to out-of-distribution returns. To address this, we propose Doctor, a framework that deeply integrates target-conditioned sequence modeling with a dynamic return-alignment mechanism and introduces a novel dual-verification scheme. This design substantially enhances the Transformer’s capacity to modulate policy behavior across arbitrary target returns. Doctor is the first method to achieve precise, stable regulation of actual policy returns, simultaneously ensuring robustness in both interpolation and extrapolation. On benchmarks including EpiCare, Doctor significantly improves target-return alignment accuracy. Furthermore, it demonstrates practical efficacy in clinical treatment strategy optimization—enabling fine-grained balancing between therapeutic benefit and adverse-event risk.
📝 Abstract
Offline reinforcement learning (RL) has achieved significant advances in domains such as robotic control, autonomous driving, and medical decision-making. Most existing methods primarily focus on training policies that maximize cumulative returns from a given dataset. However, many real-world applications require precise control over policy performance levels, rather than simply pursuing the best possible return. Reinforcement learning via supervised learning (RvS) frames offline RL as a sequence modeling task, enabling the extraction of diverse policies by conditioning on different desired returns. Yet, existing RvS-based transformers, such as Decision Transformer (DT), struggle to reliably align the actual achieved returns with specified target returns, especially when interpolating within underrepresented returns or extrapolating beyond the dataset. To address this limitation, we propose Doctor, a novel approach that Double Checks the Transformer with target alignment for Offline RL. Doctor achieves superior target alignment both within and beyond the dataset, while enabling accurate and flexible control over policy performance. Notably, on the dynamic treatment regime benchmark, EpiCare, our approach effectively modulates treatment policy aggressiveness, balancing therapeutic returns against adverse event risk.