RPRO:Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning

📅 2025-08-31

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Current large medical language models suffer from frequent factual inaccuracies and low clinical reliability in their reasoning chains. To address this, we propose RPRO—a novel reinforcement learning framework for optimizing clinical reasoning chains. RPRO uniquely integrates task-adaptive reasoning templates, group-level Bradley–Terry preference ranking, and KL-divergence regularization. This enables automatic identification and correction of low-quality reasoning paths, substantially improving both factual accuracy and clinical consistency. Evaluated on PubMedQA and MedQA-USMLE benchmarks, RPRO consistently outperforms baseline models across 7B–13B parameter scales. Notably, a compact 1.1B-parameter variant achieves state-of-the-art performance, demonstrating the method’s efficiency, robustness, and scalability. RPRO thus establishes a principled, clinically grounded paradigm for reasoning-chain optimization in medical AI.

Technology Category

Application Category

📝 Abstract

Medical question answering requires advanced reasoning that integrates domain knowledge with logical inference. However, existing large language models (LLMs) often generate reasoning chains that lack factual accuracy and clinical reliability. We propose Ranked Preference Reinforcement Optimization (RPRO), a novel framework that uniquely combines reinforcement learning with preference-driven reasoning refinement to enhance clinical chain-of-thought (CoT) performance. RPRO differentiates itself from prior approaches by employing task-adaptive reasoning templates and a probabilistic evaluation mechanism that aligns outputs with established clinical workflows, while automatically identifying and correcting low-quality reasoning chains. Unlike traditional pairwise preference methods, RPRO introduces a groupwise ranking optimization based on the Bradley-Terry model and incorporates KL-divergence regularization for stable training. Experiments on PubMedQA and MedQA-USMLE show consistent improvements over strong baselines. Remarkably, our 1.1B parameter model outperforms much larger 7B-13B models, including medical-specialized variants. These findings demonstrate that combining preference optimization with quality-driven refinement offers a scalable and effective approach to building more reliable, clinically grounded medical LLMs.

Problem

Research questions and friction points this paper is trying to address.

Enhancing clinical reasoning accuracy in medical QA

Correcting low-quality chains in diagnostic reasoning

Aligning LLM outputs with clinical workflows

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning with preference-driven refinement

Groupwise ranking optimization using Bradley-Terry model

Task-adaptive reasoning templates with probabilistic evaluation

🔎 Similar Papers

No similar papers found.