Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the pervasive lack of auditability and transparency in current large language model (LLM)-driven clinical decision support systems, where most “open-source” models release only weights without disclosing data provenance, preprocessing pipelines, or training protocols. The authors propose MeditronFO, the first end-to-end fully open-source framework for training and evaluating clinical LLMs, integrating multi-source medical question-answering datasets augmented with three types of physician-reviewed synthetic data. Rigorous safeguards—including systematic data decontamination, gold-label resampling, end-to-end validation by a four-physician panel, and an expert case-based LLM-as-a-judge evaluation protocol—ensure full reproducibility and auditability. Performance gains are consistently observed across five open foundation models; notably, Apertus-70B-MeditronFO achieves 53.8% on comprehensive medical benchmarks, setting a new state-of-the-art among fully open models, while Gemma-3-27B-MeditronFO outperforms MedGemma on most metrics.

📝 Abstract

Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.

Problem

Research questions and friction points this paper is trying to address.

clinical decision support systems

auditable pipeline

open-weight models

data provenance

reproducible validation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fully Open Pipeline

Clinician-Audited Corpus

Decontamination

LLM-as-a-Judge Evaluation

Reproducible Training Framework

🔎 Similar Papers

No similar papers found.