Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This study addresses critical limitations in evaluating medical large language models (LLMs)—namely benchmark saturation, closed data access, and insufficient task coverage—by introducing the first fully open-source evaluation suite encompassing 30 diverse tasks, including question answering, information extraction, medical calculation, and open-ended clinical reasoning. The authors systematically assess 61 models across 71 configurations using an automated, multi-model, multi-task framework that integrates verifiable metrics with LLM-as-a-Judge methodologies; selected subsets also function as reinforcement learning environments to enhance medical reasoning capabilities. Experimental results reveal that state-of-the-art reasoning models achieve the strongest performance, domain-specific fine-tuned models significantly outperform general-purpose counterparts, closed-source models exhibit greater token efficiency, and most models display answer-order bias.

📝 Abstract

Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient coverage of relevant tasks. Existing suites have either saturated, heavily depend on restricted datasets, or lack comprehensive model coverage. We introduce Medmarks, a fully open-source evaluation suite with 30 benchmarks spanning question answering, information extraction, medical calculations, and open-ended clinical reasoning. We perform a systematic evaluation of 61 models across 71 configurations using verifiable metrics and LLM-as-a-Judge. Our results show that frontier reasoning models (Gemini 3 Pro Preview, GPT-5.1, & GPT-5.2) achieve the highest performance across both benchmarks, most frontier proprietary models are significantly more token efficient than open-weight alternatives, medically fine-tuned models outperform their generalist counterparts, and that models are susceptible to answer-order bias (particularly smaller models and Grok 4). A subset of our evals (Medmarks-T) can be directly used as reinforcement learning environments to post-train LLMs for medical reasoning. Code is available at https://github.com/MedARC-AI/Medmarks

Problem

Research questions and friction points this paper is trying to address.

LLM benchmark

medical tasks

benchmark saturation

data accessibility

task coverage

Innovation

Methods, ideas, or system contributions that make the work stand out.

open-source benchmark

medical LLM evaluation

LLM-as-a-Judge