Baichuan-M2: Scaling Medical Capability with Large Verifier System

๐Ÿ“… 2025-09-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing evaluations of medical large language models (LLMs) rely on static examinations (e.g., USMLE), failing to capture the dynamic, interactive decision-making demands of real-world clinical practice. Method: We propose the first dynamic validation framework for medical decision-making, integrating a high-fidelity patient simulator with a clinical scoring generator to establish a closed-loop interactive evaluation environment. Our approach introduces a multidimensional dynamic assessment system and employs a 32B-parameter enhanced reasoning model trained via a modified Group Relative Policy Optimization (GRPO) algorithm through large-scale interactive reinforcement learning. Contribution/Results: Our framework substantially overcomes limitations of static benchmarks: on HealthBench, it surpasses all open-source models and most closed-source models; its Hard subset score exceeds 32โ€”setting a new state-of-the-art for open-source medical LLMsโ€”and achieves performance approaching that of GPT-5, thereby establishing the current Pareto frontier of performance versus scale in medical LLMs.

Technology Category

Application Category

๐Ÿ“ Abstract
As large language models (LLMs) advance in conversational and reasoning capabilities, their practical application in healthcare has become a critical research focus. However, there is a notable gap between the performance of medical LLMs on static benchmarks such as USMLE and their utility in real-world clinical decision-making. This discrepancy arises because traditional exams fail to capture the dynamic, interactive nature of medical consultations. To address this challenge, we introduce a novel dynamic verification framework that moves beyond static answer verifier, establishing a large-scale, high-fidelity interactive reinforcement learning system. Our framework comprises two key components: a Patient Simulator that creates realistic clinical environments using de-identified medical records, and a Clinical Rubrics Generator that dynamically produces multi-dimensional evaluation metrics. Building on this foundation, we develop Baichuan-M2, a 32B-parameter medical augmented reasoning model trained through a multi-stage reinforcement learning strategy with an improved Group Relative Policy Optimization (GRPO) algorithm. Evaluated on HealthBench, Baichuan-M2 outperforms all other open-source models and most advanced closed-source counterparts, achieving a score above 32 on the challenging HealthBench Hard benchmark-previously exceeded only by GPT-5. Our work demonstrates that robust dynamic verifier system is essential for aligning LLM capabilities with practical clinical applications, establishing a new Pareto front in the performance-parameter trade-off for medical AI deployment.
Problem

Research questions and friction points this paper is trying to address.

Bridging gap between medical LLM benchmarks and real-world clinical utility
Addressing dynamic interactive nature missing in traditional medical exams
Developing verifier system for practical clinical decision-making alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic verification framework for clinical decision-making
Patient Simulator using de-identified medical records
Group Relative Policy Optimization algorithm training
๐Ÿ”Ž Similar Papers
No similar papers found.
B
Baichuan-M2 Team
C
Chengfeng Dou
C
Chong Liu
F
Fan Yang
F
Fei Li
J
Jiyuan Jia
Mingyang Chen
Mingyang Chen
Baichuan Inc., Zhejiang University, The University of Edinburgh
Large Language ModelReinforcement LearningKnowledge Graph
Q
Qiang Ju
S
Shuai Wang
S
Shunya Dang
T
Tianpeng Li
X
Xiangrong Zeng
Yijie Zhou
Yijie Zhou
The Chinese University of Hong Kong, Shenzhen
Distributed OptimizationPrivacy Preserving
C
Chenzheng Zhu
D
Da Pan
Fei Deng
Fei Deng
Research Scientist, Google
Diffusion ModelsRLHFReinforcement LearningGenerative ModelsObject-Centric Learning
G
Guangwei Ai
G
Guosheng Dong
H
Hongda Zhang
J
Jinyang Tai
J
Jixiang Hong
K
Kai Lu
Linzhuang Sun
Linzhuang Sun
University of Chinese Academy of Sciences
Multimodal Reasoning
P
Peidong Guo