More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of high-quality annotated data and high annotation costs in radiology vision-language contrastive pretraining, this paper proposes a zero-shot diagnostic label extraction method leveraging large language models (LLMs), which automatically generates high-precision “silver-standard” image-label pairs from radiology reports without complex prompt engineering. The approach integrates a 3D ResNet-18 encoder with the CLIP framework and performs contrastive learning on a large-scale, self-collected CT image–report paired dataset. Experiments demonstrate state-of-the-art zero-shot diagnostic performance: 83.8% and 77.3% AUC on CT-RATE and RAD-ChestCT, respectively; cross-modal image–image retrieval achieves 53.7% mAP@50—substantially outperforming existing baselines. This work is the first to systematically introduce an LLM-driven automated label construction pipeline into medical vision-language pretraining, achieving both improved performance and scalability while significantly lowering domain adaptation barriers.

Technology Category

Application Category

📝 Abstract
The emergence of Large Language Models (LLMs) presents unprecedented opportunities to revolutionize medical contrastive vision-language pre-training. In this paper, we show how LLMs can facilitate large-scale supervised pre-training, thereby advancing vision-language alignment. We begin by demonstrate that modern LLMs can automatically extract diagnostic labels from radiology reports with remarkable precision (>96% AUC in our experiments) without complex prompt engineering, enabling the creation of large-scale "silver-standard" datasets at a minimal cost (~$3 for 50k CT image-report pairs). Further, we find that vision encoder trained on this "silver-standard" dataset achieves performance comparable to those trained on labels extracted by specialized BERT-based models, thereby democratizing the access to large-scale supervised pre-training. Building on this foundation, we proceed to reveal that supervised pre-training fundamentally improves contrastive vision-language alignment. Our approach achieves state-of-the-art performance using only a 3D ResNet-18 with vanilla CLIP training, including 83.8% AUC for zero-shot diagnosis on CT-RATE, 77.3% AUC on RAD-ChestCT, and substantial improvements in cross-modal retrieval (MAP@50=53.7% for image-image, Recall@100=52.2% for report-image). These results demonstrate the potential of utilizing LLMs to facilitate {f more performant and scalable} medical AI systems. Our code is avaiable at https://github.com/SadVoxel/More-performant-and-scalable.
Problem

Research questions and friction points this paper is trying to address.

Automating diagnostic label extraction from radiology reports
Enabling cost-effective large-scale medical vision-language pre-training
Improving contrastive vision-language alignment for medical AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using LLMs to extract diagnostic labels from reports
Creating low-cost silver-standard datasets for pre-training
Achieving state-of-the-art performance with simple CLIP training
🔎 Similar Papers
No similar papers found.
Yingtai Li
Yingtai Li
University of Science & Technology of China
Haoran Lai
Haoran Lai
University of Science and Technology of China
Medical Image ProcessingDeep Learning
X
Xiaoqian Zhou
School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China (USTC), Hefei Anhui, 230026, China; Center for Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE), Suzhou Institute for Advance Research, USTC, Suzhou Jiangsu, 215123, China
S
Shuai Ming
The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, USTC, Hefei Anhui, 230001, China
Wenxin Ma
Wenxin Ma
University of Science and Technology of China
AIcomputer vision
W
Wei Wei
The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, USTC, Hefei Anhui, 230001, China
Shaohua Kevin Zhou
Shaohua Kevin Zhou
Professor, USTC, FAIMBE, FIAMBE, FIEEE, FMICCAI, FNAI
Medical Image ComputingComputer Vision & Pattern RecognitionMachine & Deep Learning