🤖 AI Summary
To address insufficient personalization in medical interventions, this paper proposes an LLM-RL collaborative framework: a large language model (LLM) serves as an online action filter for reinforcement learning (RL) algorithms—specifically PPO or SAC—by dynamically parsing unstructured user textual preferences (e.g., health status, contraindications) and generating interpretable, context-aware intervention policies in real time. Methodologically, we introduce a novel multi-strategy prompting and action fusion mechanism that eliminates reliance on structured reward signals traditionally required by RL. Furthermore, we construct a simulation environment embedded with behavioral dynamics constraints to rigorously evaluate policy efficacy and safety. Experimental results demonstrate that our approach significantly enhances personalization and long-term intervention effectiveness—achieving an 18.7% increase in cumulative reward—while preserving clinical interpretability and enabling real-time adaptability to evolving patient conditions.
📝 Abstract
Reinforcement learning (RL) is increasingly being used in the healthcare domain, particularly for the development of personalized health adaptive interventions. Inspired by the success of Large Language Models (LLMs), we are interested in using LLMs to update the RL policy in real time, with the goal of accelerating personalization. We use the text-based user preference to influence the action selection on the fly, in order to immediately incorporate the user preference. We use the term"user preference"as a broad term to refer to a user personal preference, constraint, health status, or a statement expressing like or dislike, etc. Our novel approach is a hybrid method that combines the LLM response and the RL action selection to improve the RL policy. Given an LLM prompt that incorporates the user preference, the LLM acts as a filter in the typical RL action selection. We investigate different prompting strategies and action selection strategies. To evaluate our approach, we implement a simulation environment that generates the text-based user preferences and models the constraints that impact behavioral dynamics. We show that our approach is able to take into account the text-based user preferences, while improving the RL policy, thus improving personalization in adaptive intervention.