🤖 AI Summary
This work addresses the paradigm mismatch in current medical question-answering alignment methods, where human preference annotations are costly and fail to capture the absolute correctness of medical facts, while verifiable rewards suffer from the absence of effective automatic validators and multi-objective heterogeneous rewards often lead to scale mismatches and optimization conflicts. To resolve this, the authors propose a multidimensional medical alignment matrix that decomposes alignment objectives into four categories—core capabilities, expert knowledge, online feedback, and formatting compliance—and generates fine-grained reward signals from observable metrics under a unified optimization framework. Key innovations include a closed-loop diagnosis-reward mechanism, Reference-Frozen Normalization for consistent reward scaling, and a Tri-Factor adaptive dynamic weighting strategy that enables weakness-aware, risk-prioritized co-optimization. Experiments demonstrate significant improvements in correctness, safety, and compliance, establishing a new paradigm for complex alignment tasks in specialized domains.
📝 Abstract
While reinforcement learning for large language model alignment has progressed rapidly in recent years, transferring these paradigms to high-stakes medical question answering reveals a fundamental paradigm mismatch. Reinforcement Learning from Human Feedback relies on preference annotations that are prohibitively expensive and often fail to reflect the absolute correctness of medical facts. Reinforcement Learning from Verifiable Rewards lacks effective automatic verifiers and struggles to handle complex clinical contexts. Meanwhile, medical alignment requires the simultaneous optimization of correctness, safety, and compliance, yet multi-objective heterogeneous reward signals are prone to scale mismatch and optimization conflicts.To address these challenges, we propose a robust medical alignment paradigm. We first construct a holistic multi-dimensional medical alignment matrix that decomposes alignment objectives into four categories: fundamental capabilities, expert knowledge, online feedback, and format specifications. Within each category, we establish a closed loop of where observable metrics inform attributable diagnosis, which in turn drives optimizable rewards, thereby providing fine-grained, high-resolution supervision signals for subsequent iterative optimization. To resolve gradient domination and optimization instability problem caused by heterogeneous signals, we further propose a unified optimization mechanism. This mechanism employs Reference-Frozen Normalization to align reward scales and implements a Tri-Factor Adaptive Dynamic Weighting strategy to achieve collaborative optimization that is weakness-oriented, risk-prioritized, and redundancy-reducing. Experimental results demonstrate the effectiveness of our proposed paradigm in real-world medical scenario evaluations, establishing a new paradigm for complex alignment in vertical domains.