RPRO:Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning

📅 2025-08-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large medical language models suffer from frequent factual inaccuracies and low clinical reliability in their reasoning chains. To address this, we propose RPRO—a novel reinforcement learning framework for optimizing clinical reasoning chains. RPRO uniquely integrates task-adaptive reasoning templates, group-level Bradley–Terry preference ranking, and KL-divergence regularization. This enables automatic identification and correction of low-quality reasoning paths, substantially improving both factual accuracy and clinical consistency. Evaluated on PubMedQA and MedQA-USMLE benchmarks, RPRO consistently outperforms baseline models across 7B–13B parameter scales. Notably, a compact 1.1B-parameter variant achieves state-of-the-art performance, demonstrating the method’s efficiency, robustness, and scalability. RPRO thus establishes a principled, clinically grounded paradigm for reasoning-chain optimization in medical AI.

Technology Category

Application Category

📝 Abstract
Medical question answering requires advanced reasoning that integrates domain knowledge with logical inference. However, existing large language models (LLMs) often generate reasoning chains that lack factual accuracy and clinical reliability. We propose Ranked Preference Reinforcement Optimization (RPRO), a novel framework that uniquely combines reinforcement learning with preference-driven reasoning refinement to enhance clinical chain-of-thought (CoT) performance. RPRO differentiates itself from prior approaches by employing task-adaptive reasoning templates and a probabilistic evaluation mechanism that aligns outputs with established clinical workflows, while automatically identifying and correcting low-quality reasoning chains. Unlike traditional pairwise preference methods, RPRO introduces a groupwise ranking optimization based on the Bradley-Terry model and incorporates KL-divergence regularization for stable training. Experiments on PubMedQA and MedQA-USMLE show consistent improvements over strong baselines. Remarkably, our 1.1B parameter model outperforms much larger 7B-13B models, including medical-specialized variants. These findings demonstrate that combining preference optimization with quality-driven refinement offers a scalable and effective approach to building more reliable, clinically grounded medical LLMs.
Problem

Research questions and friction points this paper is trying to address.

Enhancing clinical reasoning accuracy in medical QA
Correcting low-quality chains in diagnostic reasoning
Aligning LLM outputs with clinical workflows
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning with preference-driven refinement
Groupwise ranking optimization using Bradley-Terry model
Task-adaptive reasoning templates with probabilistic evaluation
🔎 Similar Papers
No similar papers found.
C
Chia-Hsuan Hsu
Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan; Far Eastern Memorial Hospital, New Taipei, Taiwan
Jun-En Ding
Jun-En Ding
Stevens institute of Technology
AI for HealthcareMultimodal LearningElectronic Health RecordsComputational Neuroscience
Hsin-Ling Hsu
Hsin-Ling Hsu
National Chengchi University
Information RetrievalNatural Language ProcessingAI for HealthcareTrustworthy AI
F
Feng Liu
Department of Systems Engineering, Stevens Institute of Technology, New Jersey, USA
F
Fang-Ming Hung
Far Eastern Memorial Hospital, New Taipei, Taiwan