π€ AI Summary
Medical language model evaluation faces bottlenecks: heavy reliance on costly expert annotation and absence of ground-truth reference answers. To address this, we propose MedVALβthe first self-supervised verification framework targeting clinical factuality and safety. Our method eliminates the need for physician annotations or reference texts by leveraging synthetically generated data and multi-task fine-tuning to train compact language models (e.g., MedVAL-4B) for automated clinical error detection. We further introduce MedVAL-Bench, a physician-curated, fine-grained error taxonomy and risk-tiered benchmark, enhancing clinical relevance. Evaluated across six medical tasks and ten state-of-the-art models, MedVAL achieves an average F1 score of 83% (+17 percentage points), significantly improving alignment with expert judgments; notably, it boosts GPT-4oβs consistency by 8%. The code, datasets, and models are publicly released.
π Abstract
With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LM-as-judge" paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset containing 840 outputs annotated by physicians, following a physician-defined taxonomy of risk levels and error categories. Across 6 diverse medical tasks and 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves (p < 0.001) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66% to 83%, with per-sample safety classification scores up to 86%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8%. To support a scalable, risk-aware pathway towards clinical integration, we open-source the 1) codebase ( https://github.com/StanfordMIMI/MedVAL ), 2) MedVAL-Bench ( https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench ), and 3) MedVAL-4B ( https://huggingface.co/stanfordmimi/MedVAL-4B ), the best-performing open-source LM. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.