๐ค AI Summary
Clinical decision-making requires balancing multiple diagnostic hypotheses, yet existing medical reasoning models (MRMs) produce only single-answer outputs, leading to narrow, potentially unsafe reasoning. This work is the first systematic study on training MRMs to generate ranked answer listsโenabling safe, comprehensive diagnosis for open-ended clinical questions. We propose a novel ranking-aware reward function and integrate it into a unified training framework combining prompt engineering, supervised fine-tuning (SFT), and reinforcement fine-tuning (RFT). Experiments demonstrate that RFT substantially improves model robustness across diverse response formats; the resulting ranked lists not only recover ground-truth diagnoses but also surface clinically plausible alternatives, thereby enhancing decision support. Our approach advances medical large language models toward a multi-hypothesis, interpretable, and safety-critical reasoning paradigm.
๐ Abstract
This paper presents a systematic study on enabling medical reasoning models (MRMs) to generate ranked lists of answers for open-ended questions. Clinical decision-making rarely relies on a single answer but instead considers multiple options, reducing the risks of narrow perspectives. Yet current MRMs are typically trained to produce only one answer, even in open-ended settings. We propose an alternative format: ranked lists and investigate two approaches: prompting and fine-tuning. While prompting is a cost-effective way to steer an MRM's response, not all MRMs generalize well across different answer formats: choice, short text, and list answers. Based on our prompting findings, we train and evaluate MRMs using supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT teaches a model to imitate annotated responses, and RFT incentivizes exploration through the responses that maximize a reward. We propose new reward functions targeted at ranked-list answer formats, and conduct ablation studies for RFT. Our results show that while some SFT models generalize to certain answer formats, models trained with RFT are more robust across multiple formats. We also present a case study on a modified MedQA with multiple valid answers, finding that although MRMs might fail to select the benchmark's preferred ground truth, they can recognize valid answers. To the best of our knowledge, this is the first systematic investigation of approaches for enabling MRMs to generate answers as ranked lists. We hope this work provides a first step toward developing alternative answer formats that are beneficial beyond single answers in medical domains.