Expert-level validation of AI-generated medical text with scalable language models

πŸ“… 2025-07-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Medical language model evaluation faces bottlenecks: heavy reliance on costly expert annotation and absence of ground-truth reference answers. To address this, we propose MedVALβ€”the first self-supervised verification framework targeting clinical factuality and safety. Our method eliminates the need for physician annotations or reference texts by leveraging synthetically generated data and multi-task fine-tuning to train compact language models (e.g., MedVAL-4B) for automated clinical error detection. We further introduce MedVAL-Bench, a physician-curated, fine-grained error taxonomy and risk-tiered benchmark, enhancing clinical relevance. Evaluated across six medical tasks and ten state-of-the-art models, MedVAL achieves an average F1 score of 83% (+17 percentage points), significantly improving alignment with expert judgments; notably, it boosts GPT-4o’s consistency by 8%. The code, datasets, and models are publicly released.

Technology Category

Application Category

πŸ“ Abstract
With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LM-as-judge" paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset containing 840 outputs annotated by physicians, following a physician-defined taxonomy of risk levels and error categories. Across 6 diverse medical tasks and 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves (p < 0.001) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66% to 83%, with per-sample safety classification scores up to 86%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8%. To support a scalable, risk-aware pathway towards clinical integration, we open-source the 1) codebase ( https://github.com/StanfordMIMI/MedVAL ), 2) MedVAL-Bench ( https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench ), and 3) MedVAL-4B ( https://huggingface.co/stanfordmimi/MedVAL-4B ), the best-performing open-source LM. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.
Problem

Research questions and friction points this paper is trying to address.

Evaluating accuracy and safety of AI-generated medical text
Reducing reliance on costly manual physician reviews
Detecting clinically significant errors without expert references
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised framework for medical text validation
Synthetic data trains evaluator language models
Open-source tools for scalable clinical integration
πŸ”Ž Similar Papers
No similar papers found.
A
Asad Aali
Stanford University
V
Vasiliki Bikia
Stanford University
Maya Varma
Maya Varma
Stanford University
Computer Science
Nicole Chiou
Nicole Chiou
Ph.D. Student, Stanford University
CausalityDistribution shiftsHealthcare AIResponsible AI
Sophie Ostmeier
Sophie Ostmeier
Stanford University
MLMedicine
A
Arnav Singhvi
Stanford University
Magdalini Paschali
Magdalini Paschali
Postdoctoral Scholar, Stanford University
Deep LearningComputer VisionMedical Imaging
Ashwin Kumar
Ashwin Kumar
Washington University in St Louis
Reinforcement LearningResource AllocationFairnessRide-sharingExplainable AI Planning
A
Andrew Johnston
Stanford University
K
Karimar Amador-Martinez
Stanford University
E
Eduardo Juan Perez Guerrero
Stanford University
P
Paola Naovi Cruz Rivera
Stanford University
Sergios Gatidis
Sergios Gatidis
Stanford Medicine
Healthcare AIMedical Image and Data AnalysisPediatric RadiologyHybrid Imaging
Christian Bluethgen
Christian Bluethgen
Radiologist, Clinician Scientist, USZ Zurich, AIMI Center, Stanford University
RadiologyThoracic ImagingMultimodal Machine Learning
Eduardo Pontes Reis
Eduardo Pontes Reis
Stanford University / Hospital Israelita Albert Einstein
Machine Learning in Healthcare
E
Eddy D. Zandee van Rilland
Stanford University
P
Poonam Laxmappa Hosamani
Stanford University
K
Kevin R Keet
Stanford University
M
Minjoung Go
Stanford University
E
Evelyn Ling
Stanford University
D
David B. Larson
Stanford University
C
Curtis Langlotz
Stanford University
R
Roxana Daneshjou
Stanford University
J
Jason Hom
Stanford University
Sanmi Koyejo
Sanmi Koyejo
Assistant Professor, Stanford University
Machine LearningHealthcare AINeuroinformatics