LLM-Guided Diagnostic Evidence Alignment for Medical Vision-Language Pretraining under Limited Pairing

📅 2026-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing medical vision–language pretraining methods struggle to learn reliable diagnostic representations under limited paired data: global alignment is susceptible to non-diagnostic distractions, while local alignment fails to cohesively integrate critical evidence. To address this, this work proposes an evidence-level alignment pretraining framework that leverages large language models (LLMs) with prompt engineering to extract diagnostic evidence from radiology reports, thereby constructing a cross-modal shared semantic space grounded in clinical reasoning. By combining contrastive learning with strategies to utilize unpaired data, the approach shifts the alignment objective from holistic image–text matching to clinically coherent evidence-level correspondence. This method substantially reduces reliance on paired data and achieves performance comparable to state-of-the-art approaches—despite their dependence on extensive paired corpora—on tasks including phrase grounding, image–text retrieval, and zero-shot classification.

Technology Category

Application Category

📝 Abstract
Most existing CLIP-style medical vision--language pretraining methods rely on global or local alignment with substantial paired data. However, global alignment is easily dominated by non-diagnostic information, while local alignment fails to integrate key diagnostic evidence. As a result, learning reliable diagnostic representations becomes difficult, which limits their applicability in medical scenarios with limited paired data. To address this issue, we propose an LLM-Guided Diagnostic Evidence Alignment method (LGDEA), which shifts the pretraining objective toward evidence-level alignment that is more consistent with the medical diagnostic process. Specifically, we leverage LLMs to extract key diagnostic evidence from radiology reports and construct a shared diagnostic evidence space, enabling evidence-aware cross-modal alignment and allowing LGDEA to effectively exploit abundant unpaired medical images and reports, thereby substantially alleviating the reliance on paired data. Extensive experimental results demonstrate that our method achieves consistent and significant improvements on phrase grounding, image--text retrieval, and zero-shot classification, and even rivals pretraining methods that rely on substantial paired data.
Problem

Research questions and friction points this paper is trying to address.

medical vision-language pretraining
limited paired data
diagnostic evidence alignment
cross-modal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-guided alignment
diagnostic evidence
medical vision-language pretraining
evidence-level alignment
limited paired data
🔎 Similar Papers
No similar papers found.
H
Huimin Yan
Institute of Intelligent Information Processing, Shanxi University, Taiyuan, 030006, China
L
Liang Bai
Institute of Intelligent Information Processing, Shanxi University, Taiyuan, 030006, China
Xian Yang
Xian Yang
University of Manchester
Artificial IntelligenceMachine LearningHealthcare AINatural Language Processing
L
Long Chen
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China