Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks

๐Ÿ“… 2026-05-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

163K/year
๐Ÿค– AI Summary
This study addresses critical limitations in evaluating medical large language models (LLMs)โ€”namely benchmark saturation, closed data access, and insufficient task coverageโ€”by introducing the first fully open-source evaluation suite encompassing 30 diverse tasks, including question answering, information extraction, medical calculation, and open-ended clinical reasoning. The authors systematically assess 61 models across 71 configurations using an automated, multi-model, multi-task framework that integrates verifiable metrics with LLM-as-a-Judge methodologies; selected subsets also function as reinforcement learning environments to enhance medical reasoning capabilities. Experimental results reveal that state-of-the-art reasoning models achieve the strongest performance, domain-specific fine-tuned models significantly outperform general-purpose counterparts, closed-source models exhibit greater token efficiency, and most models display answer-order bias.
๐Ÿ“ Abstract
Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient coverage of relevant tasks. Existing suites have either saturated, heavily depend on restricted datasets, or lack comprehensive model coverage. We introduce Medmarks, a fully open-source evaluation suite with 30 benchmarks spanning question answering, information extraction, medical calculations, and open-ended clinical reasoning. We perform a systematic evaluation of 61 models across 71 configurations using verifiable metrics and LLM-as-a-Judge. Our results show that frontier reasoning models (Gemini 3 Pro Preview, GPT-5.1, & GPT-5.2) achieve the highest performance across both benchmarks, most frontier proprietary models are significantly more token efficient than open-weight alternatives, medically fine-tuned models outperform their generalist counterparts, and that models are susceptible to answer-order bias (particularly smaller models and Grok 4). A subset of our evals (Medmarks-T) can be directly used as reinforcement learning environments to post-train LLMs for medical reasoning. Code is available at https://github.com/MedARC-AI/Medmarks
Problem

Research questions and friction points this paper is trying to address.

LLM benchmark
medical tasks
benchmark saturation
data accessibility
task coverage
Innovation

Methods, ideas, or system contributions that make the work stand out.

open-source benchmark
medical LLM evaluation
LLM-as-a-Judge
token efficiency
reinforcement learning environment
๐Ÿ”Ž Similar Papers
No similar papers found.
Benjamin Warner
Benjamin Warner
Answer.AI
R
Ratna Sagari Grandhi
MedARC
M
Max Kieffer
MedARC
A
Aymane Ouraq
MedARC
S
Saurav Panigrahi
MedARC
G
Geetu Ambwani
MedARC
K
Kunal Bagga
MedARC
N
Nikhil Khandekar
MedARC
A
Arya Hariharan
MedARC
N
Nishant Mishra
MedARC
M
Manish Ram
MedARC
S
Shamus Sim Zi Yang
MedARC
A
Ahmed Essouaied
MedARC
A
Adepoju Jeremiah Moyondafoluwa
MedARC
R
Robert Scholz
MedARC
B
Bofeng Huang
MedARC
M
Molly Beavers
MedARC
S
Srishti Gureja
MedARC
Anish Mahishi
Anish Mahishi
New York University
S
Sameed Khan
MedARC
M
Maxime Griot
MedARC
Hunar Batra
Hunar Batra
University of Oxford
Machine LearningLanguage ModelsMultimodal AIReinforcement LearningAI Safety
Jean-Benoit Delbrouck
Jean-Benoit Delbrouck
Hugging Face, Stanford
Siddhant Bharadwaj
Siddhant Bharadwaj
Project Associate, Indian Institute of Science
Computer Vision
Ronald Clark
Ronald Clark
University of Oxford
Computer VisionRoboticsMachine LearningOptimisation