🤖 AI Summary
This work addresses the challenge in clinical offline reinforcement learning where reward functions are typically handcrafted using heuristic rules, limiting their generalizability across diverse disease contexts. To overcome this limitation, the study introduces an automated reward engineering framework that leverages large language models for the first time in this domain. The framework constructs a potential function based on three clinically meaningful dimensions—survivability, confidence, and capability—to generate reward signals. Prior to deployment, candidate reward structures are quantitatively evaluated and optimized through a principled selection process. Empirical results demonstrate that this approach enables disease-specific yet generalizable reward design, significantly improving policy performance across multiple clinical scenarios and validating the effectiveness and broad applicability of automated reward generation and evaluation.
📝 Abstract
Reinforcement Learning (RL) offers a powerful framework for optimizing dynamic treatment regimes (DTRs). However, clinical RL is fundamentally bottlenecked by reward engineering: the challenge of defining signals that safely and effectively guide policy learning in complex, sparse offline environments. Existing approaches often rely on manual heuristics that fail to generalize across diverse pathologies. To address this, we propose an automated pipeline leveraging Large Language Models (LLMs) for offline reward design and verification. We formulate the reward function using potential functions consisted of three core components: survival, confidence, and competence. We further introduce quantitative metrics to rigorously evaluate and select the optimal reward structure prior to deployment. By integrating LLM-driven domain knowledge, our framework automates the design of reward functions for specific diseases while significantly enhancing the performance of the resulting policies.