Fleming-R1: Toward Expert-Level Medical Reasoning via Reinforcement Learning

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) exhibit low accuracy and opaque reasoning processes in medical inference. Method: This paper proposes a verifiable training framework for expert-level medical reasoning. It introduces a reasoning-oriented data construction strategy integrating knowledge-graph-guided data synthesis and chain-of-thought distillation. A two-stage reinforcement learning paradigm is employed: (1) Reinforcement Learning with Verifiable Rewards (RLVR) to optimize reasoning paths, and (2) Groupwise Relative Policy Optimization (GRPO) to enhance multi-hop and rare-disease reasoning. Contribution/Results: Experimental results show that the 7B model significantly outperforms larger open-source baselines, while the 32B variant approaches GPT-4o’s performance on mainstream medical benchmarks. The framework achieves high accuracy, strong interpretability, and efficient parameter utilization—marking the first demonstration of unified verifiability and robustness for clinical-grade medical reasoning.

Technology Category

Application Category

📝 Abstract
While large language models show promise in medical applications, achieving expert-level clinical reasoning remains challenging due to the need for both accurate answers and transparent reasoning processes. To address this challenge, we introduce Fleming-R1, a model designed for verifiable medical reasoning through three complementary innovations. First, our Reasoning-Oriented Data Strategy (RODS) combines curated medical QA datasets with knowledge-graph-guided synthesis to improve coverage of underrepresented diseases, drugs, and multi-hop reasoning chains. Second, we employ Chain-of-Thought (CoT) cold start to distill high-quality reasoning trajectories from teacher models, establishing robust inference priors. Third, we implement a two-stage Reinforcement Learning from Verifiable Rewards (RLVR) framework using Group Relative Policy Optimization, which consolidates core reasoning skills while targeting persistent failure modes through adaptive hard-sample mining. Across diverse medical benchmarks, Fleming-R1 delivers substantial parameter-efficient improvements: the 7B variant surpasses much larger baselines, while the 32B model achieves near-parity with GPT-4o and consistently outperforms strong open-source alternatives. These results demonstrate that structured data design, reasoning-oriented initialization, and verifiable reinforcement learning can advance clinical reasoning beyond simple accuracy optimization. We release Fleming-R1 publicly to promote transparent, reproducible, and auditable progress in medical AI, enabling safer deployment in high-stakes clinical environments.
Problem

Research questions and friction points this paper is trying to address.

Achieving expert-level clinical reasoning with transparency
Improving coverage of underrepresented diseases and drugs
Targeting persistent failure modes via adaptive hard-sample mining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge-graph-guided synthesis for disease coverage
Chain-of-Thought cold start from teacher models
Reinforcement learning with verifiable reward optimization
🔎 Similar Papers
No similar papers found.
C
Chi Liu
Ubiquant
D
Derek Li
Ubiquant
Yan Shu
Yan Shu
University of Trento << Harbin Institute of Technology
Vision and LanguageMulti-modal LearningVideo UnderstandingOCRRemote Sensing
R
Robin Chen
Ubiquant
D
Derek Duan
Ubiquant
T
Teng Fang
Ubiquant
B
Bryan Dai
Ubiquant