EVLF-FM: Explainable Vision Language Foundation Model for Medicine

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current medical AI foundation models are predominantly unimodal and lack transparent reasoning, hindering clinical deployment. To address this, we propose the first medical vision-language foundation model that simultaneously enables fine-grained interpretability and multi-disease joint reasoning. It unifies multimodal medical imaging—including X-ray, CT, and MRI—with domain-specific textual knowledge, supporting multi-disease classification, visual question answering, and pixel-level lesion localization. Our method introduces a novel hybrid training strategy integrating supervised learning with vision-guided reinforcement fine-tuning, achieving hierarchical alignment between visual evidence and diagnostic decisions. Internal validation yields accuracy=0.858, F1-score=0.797, mIoU=0.743, and Acc@0.5=0.837. External validation demonstrates strong zero-shot and few-shot generalization, significantly enhancing clinical trustworthiness and practical utility.

Technology Category

Application Category

📝 Abstract
Despite the promise of foundation models in medical AI, current systems remain limited - they are modality-specific and lack transparent reasoning processes, hindering clinical adoption. To address this gap, we present EVLF-FM, a multimodal vision-language foundation model (VLM) designed to unify broad diagnostic capability with fine-grain explainability. The development and testing of EVLF-FM encompassed over 1.3 million total samples from 23 global datasets across eleven imaging modalities related to six clinical specialties: dermatology, hepatology, ophthalmology, pathology, pulmonology, and radiology. External validation employed 8,884 independent test samples from 10 additional datasets across five imaging modalities. Technically, EVLF-FM is developed to assist with multiple disease diagnosis and visual question answering with pixel-level visual grounding and reasoning capabilities. In internal validation for disease diagnostics, EVLF-FM achieved the highest average accuracy (0.858) and F1-score (0.797), outperforming leading generalist and specialist models. In medical visual grounding, EVLF-FM also achieved stellar performance across nine modalities with average mIOU of 0.743 and Acc@0.5 of 0.837. External validations further confirmed strong zero-shot and few-shot performance, with competitive F1-scores despite a smaller model size. Through a hybrid training strategy combining supervised and visual reinforcement fine-tuning, EVLF-FM not only achieves state-of-the-art accuracy but also exhibits step-by-step reasoning, aligning outputs with visual evidence. EVLF-FM is an early multi-disease VLM model with explainability and reasoning capabilities that could advance adoption of and trust in foundation models for real-world clinical deployment.
Problem

Research questions and friction points this paper is trying to address.

Addressing modality-specific limitations in medical AI foundation models
Providing transparent reasoning processes for clinical adoption
Unifying diagnostic capability with fine-grain explainability across specialties
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal vision-language model for medical diagnosis
Pixel-level visual grounding and reasoning capabilities
Hybrid training strategy combining supervised and reinforcement fine-tuning
🔎 Similar Papers
No similar papers found.
Y
Yang Bai
1
Haoran Cheng
Haoran Cheng
Zhejiang University
Deep LearningComputer Vision
Y
Yang Zhou
1
J
Jun Zhou
1
A
Arun Thirunavukarasu
4,5
Y
Yuhe Ke
6
J
Jie Yao
2,3
K
Kanae Fukutsu
2,3
C
Chrystie Wan Ning Quek
3
A
Ashley Hong
3
L
Laura Gutierrez
3
Z
Zhen Ling Teo
2,3
D
Darren Shu Jeng Ting
2,7,8,9,10
B
Brian T. Soetikno
10
C
Christopher S. Nielsen
11
Tobias Elze
Tobias Elze
Schepens Eye Research Institute, Harvard Medical School
ophthalmologymachine learning
Z
Zengxiang Li
3,13
L
Linh Le Dinh
1
H
Hiok Hong Chan
3
V
Victor Koh
14,15
M
Marcus Tan
14,15
K
Kelvin Z. Li
16,17
L
Leonard Yip
16,17
C
Ching Yu Cheng
3,15
Yih Chung Tham
Yih Chung Tham
Yong Loo Lin School of Medicine, National University of Singapore; Singapore Eye Research Institute
OphthalmologyEpidemiologyVisual ImpairmentDeep Learning