Single Answer is Not Enough: On Generating Ranked Lists with Medical Reasoning Models

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Clinical decision-making requires balancing multiple diagnostic hypotheses, yet existing medical reasoning models (MRMs) produce only single-answer outputs, leading to narrow, potentially unsafe reasoning. This work is the first systematic study on training MRMs to generate ranked answer lists—enabling safe, comprehensive diagnosis for open-ended clinical questions. We propose a novel ranking-aware reward function and integrate it into a unified training framework combining prompt engineering, supervised fine-tuning (SFT), and reinforcement fine-tuning (RFT). Experiments demonstrate that RFT substantially improves model robustness across diverse response formats; the resulting ranked lists not only recover ground-truth diagnoses but also surface clinically plausible alternatives, thereby enhancing decision support. Our approach advances medical large language models toward a multi-hypothesis, interpretable, and safety-critical reasoning paradigm.

Technology Category

Application Category

📝 Abstract

This paper presents a systematic study on enabling medical reasoning models (MRMs) to generate ranked lists of answers for open-ended questions. Clinical decision-making rarely relies on a single answer but instead considers multiple options, reducing the risks of narrow perspectives. Yet current MRMs are typically trained to produce only one answer, even in open-ended settings. We propose an alternative format: ranked lists and investigate two approaches: prompting and fine-tuning. While prompting is a cost-effective way to steer an MRM's response, not all MRMs generalize well across different answer formats: choice, short text, and list answers. Based on our prompting findings, we train and evaluate MRMs using supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT teaches a model to imitate annotated responses, and RFT incentivizes exploration through the responses that maximize a reward. We propose new reward functions targeted at ranked-list answer formats, and conduct ablation studies for RFT. Our results show that while some SFT models generalize to certain answer formats, models trained with RFT are more robust across multiple formats. We also present a case study on a modified MedQA with multiple valid answers, finding that although MRMs might fail to select the benchmark's preferred ground truth, they can recognize valid answers. To the best of our knowledge, this is the first systematic investigation of approaches for enabling MRMs to generate answers as ranked lists. We hope this work provides a first step toward developing alternative answer formats that are beneficial beyond single answers in medical domains.

Problem

Research questions and friction points this paper is trying to address.

Enabling medical reasoning models to generate ranked answer lists

Addressing limitations of single-answer models in clinical decision-making

Developing robust methods for multiple valid medical answers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposing ranked list generation for medical questions

Using supervised and reinforcement fine-tuning approaches

Developing reward functions for ranked-list formats

🔎 Similar Papers

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions