🤖 AI Summary
This work addresses the limitations of existing medical large language model alignment methods, which rely on coarse-grained preference signals and fail to meet the multidimensional, high-precision alignment demands of clinical protocols. To overcome this, the authors propose a unified alignment framework grounded in fine-grained clinical criteria: they construct a dataset annotated with expert-defined scoring rules through a human-AI collaborative pipeline, train a multidimensional reward model using an explicit criterion-injection paradigm, and decouple safety constraints from general capabilities to guide GRPO-based reinforcement learning. The study also introduces ProMedical-Bench, an independent evaluation benchmark assessed via double-blind expert review. Evaluated on Qwen3-8B, the approach achieves a 22.3% improvement in overall accuracy and a 21.7% gain in safety compliance, while attaining state-of-the-art performance on external benchmarks such as UltraMedical.
📝 Abstract
Aligning Large Language Models (LLMs) with high-stakes medical standards remains a significant challenge, primarily due to the dissonance between coarse-grained preference signals and the complex, multi-dimensional nature of clinical protocols. To bridge this gap, we introduce ProMedical, a unified alignment framework grounded in fine-grained clinical criteria. We first construct ProMedical-Preference-50k, a dataset generated via a human-in-the-loop pipeline that augments medical instructions with rigorous, physician-derived rubrics. Leveraging this corpus, we propose the Explicit Criteria Injection paradigm to train a multi-dimensional reward model. Unlike traditional scalar reward models, our approach explicitly disentangles safety constraints from general proficiency, enabling precise guidance during reinforcement learning. To rigorously validate this framework, we establish ProMedical-Bench, a held-out evaluation suite anchored by double-blind expert adjudication. Empirical evaluations demonstrate that optimizing the Qwen3-8B base model via ProMedical-RM-guided GRPO yields substantial gains, improving overall accuracy by 22.3% and safety compliance by 21.7%, effectively rivaling proprietary frontier models. Furthermore, the aligned policy generalizes robustly to external benchmarks, demonstrating performance comparable to state-of-the-art models on UltraMedical. We publicly release our datasets, reward models, and benchmarks to facilitate reproducible research in safety-aware medical alignment.