MedArena: Comparing LLMs for Medicine-in-the-Wild Clinician Preferences

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF
📝 Abstract
Large language models (LLMs) are increasingly central to clinician workflows, spanning clinical decision support, medical education, and patient communication. However, current evaluation methods for medical LLMs rely heavily on static, templated benchmarks that fail to capture the complexity and dynamics of real-world clinical practice, creating a dissonance between benchmark performance and clinical utility. To address these limitations, we present MedArena, an interactive evaluation platform that enables clinicians to directly test and compare leading LLMs using their own medical queries. Given a clinician-provided query, MedArena presents responses from two randomly selected models and asks the user to select the preferred response. Out of 1571 preferences collected across 12 LLMs up to November 1, 2025, Gemini 2.0 Flash Thinking, Gemini 2.5 Pro, and GPT-4o were the top three models by Bradley-Terry rating. Only one-third of clinician-submitted questions resembled factual recall tasks (e.g., MedQA), whereas the majority addressed topics such as treatment selection, clinical documentation, or patient communication, with ~20% involving multi-turn conversations. Additionally, clinicians cited depth and detail and clarity of presentation more often than raw factual accuracy when explaining their preferences, highlighting the importance of readability and clinical nuance. We also confirm that the model rankings remain stable even after controlling for style-related factors like response length and formatting. By grounding evaluation in real-world clinical questions and preferences, MedArena offers a scalable platform for measuring and improving the utility and efficacy of medical LLMs.
🔎 Similar Papers
No similar papers found.
Eric Wu
Eric Wu
Stanford University
Biomedical AI
K
Kevin Wu
Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
J
Jason Hom
Division of Hospital Medicine, Department of Medicine, Stanford School of Medicine, Stanford, CA, USA
Paul H. Yi
Paul H. Yi
St. Jude Children's Research Hospital
RadiologyArtificial Intelligence
A
Angela Zhang
University of California, San Francisco, San Francisco, CA, USA
Alejandro Lozano
Alejandro Lozano
Stanford University
Foundation ModelsMultimodal LearningRetrieval Augmentation
J
Jeff Nirschl
Department of Pathology and Laboratory Medicine, University of Wisconsin School of Medicine and Public Health, Madison, WI, USA
J
Jeff Tangney
Doximity, San Francisco, CA, USA
K
Kevin Byram
Department of Medicine, Division of Rheumatology and Immunology, Vanderbilt University Medical Center, Nashville, TN, USA
B
Braydon Dymm
Department of Neurology, Charleston Area Medical Center, Charleston, WV, USA
N
Narender Annapureddy
Department of Medicine, Division of Rheumatology and Immunology, Vanderbilt University Medical Center, Nashville, TN, USA
Eric Topol
Eric Topol
Professor and EVP, Scripps Research
A.I.genomicsdigitalindividualized medicine
David Ouyang
David Ouyang
Cardiology, Kaiser Permanente
Computer VisionMachine LearningEchocardiographyCardiologyData Science
James Zou
James Zou
Stanford University
Machine learningcomputational biologycomputational healthstatisticsbiotech