🤖 AI Summary
This study identifies an “alignment paradox” in medical large language models (LLMs) for infertility diagnosis and treatment: improved algorithmic accuracy—e.g., via Generalized Reinforcement Learning from Preference Optimization (GRPO)—does not necessarily enhance clinical decision quality. Leveraging over 8,000 real-world infertility cases, we systematically compare four alignment methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), GRPO, and In-Context Learning (ICL)—within a dual-layer evaluation framework combining automated metrics and blinded clinician assessments. Results show GRPO achieves the highest technical performance, yet the SFT model attains the highest clinician win rate (51.2%), outperforming original physician decisions by +22.7%. This work provides the first empirical evidence that clinical interpretability and therapeutic feasibility are more critical than raw predictive accuracy. It challenges prevailing alignment paradigms and proposes a clinically grounded evaluation standard centered on real-world medical value.
📝 Abstract
Large language models (LLMs) are increasingly adopted in clinical decision support, yet aligning them with the multifaceted reasoning pathways of real-world medicine remains a major challenge. Using more than 8,000 infertility treatment records, we systematically evaluate four alignment strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), Group Relative Policy Optimization (GRPO), and In-Context Learning (ICL) through a dual-layer framework combining automatic benchmarks with blinded doctor-in-the-loop assessments. GRPO achieves the highest algorithmic accuracy across multiple decision layers, confirming the value of reinforcement-based optimization for structured prediction tasks. However, clinicians consistently prefer the SFT model, citing clearer reasoning processes (p = 0.035) and higher therapeutic feasibility (p = 0.019). In blinded pairwise comparisons, SFT attains the highest winning rate (51.2%), outperforming both GRPO (26.2%) and even physicians' original decisions (22.7%). These results reveal an alignment paradox: algorithmic improvements do not necessarily translate into higher clinical trust, and may diverge from human-centered preferences. Our findings highlight the need for alignment strategies that prioritize clinically interpretable and practically feasible reasoning, rather than solely optimizing decision-level accuracy.