Aligning Medical Conversational AI through Online Reinforcement Learning with Information-Theoretic Rewards

📅 2026-01-25

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work proposes Information Gain Fine-Tuning (IGFT), a method for training medical dialogue models to conduct efficient multi-turn interviews and generate comprehensive History of Present Illness (HPI) without relying on human-annotated conversational data. IGFT integrates online group relative policy optimization with a reward function based on clinical entity information gain, and leverages GPT-4o-mini to evaluate the clinical relevance, patient engagement, and specificity of model-generated questions, thereby guiding the model to actively elicit diagnostically critical information. Using LoRA to fine-tune Llama-3.1-8B-Instruct and DeepSeek-R1-Distill-Qwen-7B, IGFT achieves F1 score improvements of 10.9% and 12.9% on the Avey and MIMIC datasets, respectively, significantly outperforming OpenAI models and specialized medical baselines such as HuatuoGPT and UltraMedical.

Technology Category

Application Category

📝 Abstract

We present Information Gain Fine-Tuning (IGFT), a novel approach for training medical conversational AI to conduct effective patient interviews and generate comprehensive History of Present Illness (HPI) without requiring pre-collected human conversations. IGFT combines online Group Relative Policy Optimization (GRPO) with information-theoretic rewards, enabling models to learn from self-generated conversations with simulated patients. Unlike existing approaches that rely on expensive expert-annotated conversations or static datasets, our online RL framework allows models to discover effective questioning strategies through exploration. Our key innovation is an information gain reward function that tracks which clinical entities such as symptoms, temporal patterns, and medical history, are revealed during conversation. Each question's reward is computed based on its expected information gain combined with GPT-4o-mini quality assessments across dimensions including clinical relevance, patient engagement, and specificity. This hybrid approach ensures models learn to ask targeted, clinically appropriate questions that efficiently gather diagnostic information. We fine-tune two models using LoRA: Llama-3.1-8B-Instruct and DeepSeek-R1-Distill-Qwen-7B (a reasoning-optimized model). Training exclusively on Avey data containing concise HPIs, we evaluate generalization to MIMIC data with longer, more elaborate HPIs. DeepSeek-R1-Distill-Qwen-7B (IGFT) achieves F1 scores of 0.408 on Avey (10.9% improvement over base) and 0.289 on MIMIC (12.9% improvement), while Llama-3.1-8B-Instruct (IGFT) reaches 0.384 and 0.336 respectively. Both models outperform OpenAI's model on MIMIC and surpass medical domain-specific baselines like HuatuoGPT and UltraMedical, which were optimized for single-turn medical QA rather than multi-turn conversations.

Problem

Research questions and friction points this paper is trying to address.

Medical Conversational AI

History of Present Illness

Patient Interviewing

Online Reinforcement Learning

Information Gain

Innovation

Methods, ideas, or system contributions that make the work stand out.

Information Gain Fine-Tuning

Online Reinforcement Learning

Medical Conversational AI