Distribution-Based Masked Medical Vision-Language Model Using Structured Reports

📅 2025-07-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current medical image–language pretraining models struggle to capture the inherent variability and diagnostic uncertainty prevalent in clinical data, resulting in limited generalizability. To address this, we propose Distributed Masked Vision–Language Pretraining (D-MedVLP), the first multimodal pretraining framework explicitly incorporating uncertainty-aware learning. D-MedVLP leverages a large language model to automatically generate structured chest X-ray reports—comprising disease definitions, imaging findings, observations, and conclusions—and constructs intra-modal and inter-modal uncertainty distribution modeling objectives grounded in these reports. By jointly optimizing masked image–text reconstruction, the framework enables synergistic learning of clinical semantics and image ambiguity. Evaluated on five downstream tasks, D-MedVLP achieves state-of-the-art performance, with significant improvements in model robustness, interpretability, and clinical applicability.

Technology Category

Application Category

📝 Abstract
Medical image-language pre-training aims to align medical images with clinically relevant text to improve model performance on various downstream tasks. However, existing models often struggle with the variability and ambiguity inherent in medical data, limiting their ability to capture nuanced clinical information and uncertainty. This work introduces an uncertainty-aware medical image-text pre-training model that enhances generalization capabilities in medical image analysis. Building on previous methods and focusing on Chest X-Rays, our approach utilizes structured text reports generated by a large language model (LLM) to augment image data with clinically relevant context. These reports begin with a definition of the disease, followed by the `appearance' section to highlight critical regions of interest, and finally `observations' and `verdicts' that ground model predictions in clinical semantics. By modeling both inter- and intra-modal uncertainty, our framework captures the inherent ambiguity in medical images and text, yielding improved representations and performance on downstream tasks. Our model demonstrates significant advances in medical image-text pre-training, obtaining state-of-the-art performance on multiple downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhances generalization in medical image-text pre-training
Addresses variability and ambiguity in medical data
Improves clinical information capture with structured reports
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uncertainty-aware medical image-text pre-training model
LLM-generated structured reports for clinical context
Inter- and intra-modal uncertainty modeling framework
🔎 Similar Papers
No similar papers found.
Shreyank N Gowda
Shreyank N Gowda
Assistant Professor at the University of Nottingham
Computer VisionZero-shot LearningGreen AI
R
Ruichi Zhang
Department of Computer Science and Technology, School of Informatics, Xiamen University, Xiamen, 361005, China
Xiao Gu
Xiao Gu
University of Oxford
AI for HealthcareBiomedical Signal ProcessingWearable/Ambient IntelligenceDeep Learning
Y
Ying Weng
School of Computer Science, University of Nottingham Ningbo China, Ningbo, 315100, China
L
Lu Yang
Department of Computer Science and Technology, School of Informatics, Xiamen University, Xiamen, 361005, China