LLM-Guided Diagnostic Evidence Alignment for Medical Vision-Language Pretraining under Limited Pairing

📅 2026-02-07

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing medical vision–language pretraining methods struggle to learn reliable diagnostic representations under limited paired data: global alignment is susceptible to non-diagnostic distractions, while local alignment fails to cohesively integrate critical evidence. To address this, this work proposes an evidence-level alignment pretraining framework that leverages large language models (LLMs) with prompt engineering to extract diagnostic evidence from radiology reports, thereby constructing a cross-modal shared semantic space grounded in clinical reasoning. By combining contrastive learning with strategies to utilize unpaired data, the approach shifts the alignment objective from holistic image–text matching to clinically coherent evidence-level correspondence. This method substantially reduces reliance on paired data and achieves performance comparable to state-of-the-art approaches—despite their dependence on extensive paired corpora—on tasks including phrase grounding, image–text retrieval, and zero-shot classification.

Technology Category

Application Category

📝 Abstract

Most existing CLIP-style medical vision--language pretraining methods rely on global or local alignment with substantial paired data. However, global alignment is easily dominated by non-diagnostic information, while local alignment fails to integrate key diagnostic evidence. As a result, learning reliable diagnostic representations becomes difficult, which limits their applicability in medical scenarios with limited paired data. To address this issue, we propose an LLM-Guided Diagnostic Evidence Alignment method (LGDEA), which shifts the pretraining objective toward evidence-level alignment that is more consistent with the medical diagnostic process. Specifically, we leverage LLMs to extract key diagnostic evidence from radiology reports and construct a shared diagnostic evidence space, enabling evidence-aware cross-modal alignment and allowing LGDEA to effectively exploit abundant unpaired medical images and reports, thereby substantially alleviating the reliance on paired data. Extensive experimental results demonstrate that our method achieves consistent and significant improvements on phrase grounding, image--text retrieval, and zero-shot classification, and even rivals pretraining methods that rely on substantial paired data.

Problem

Research questions and friction points this paper is trying to address.

medical vision-language pretraining

limited paired data

diagnostic evidence alignment

cross-modal alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-guided alignment

diagnostic evidence

medical vision-language pretraining

evidence-level alignment