🤖 AI Summary
To address the scarcity of high-quality annotated data and high annotation costs in radiology vision-language contrastive pretraining, this paper proposes a zero-shot diagnostic label extraction method leveraging large language models (LLMs), which automatically generates high-precision “silver-standard” image-label pairs from radiology reports without complex prompt engineering. The approach integrates a 3D ResNet-18 encoder with the CLIP framework and performs contrastive learning on a large-scale, self-collected CT image–report paired dataset. Experiments demonstrate state-of-the-art zero-shot diagnostic performance: 83.8% and 77.3% AUC on CT-RATE and RAD-ChestCT, respectively; cross-modal image–image retrieval achieves 53.7% mAP@50—substantially outperforming existing baselines. This work is the first to systematically introduce an LLM-driven automated label construction pipeline into medical vision-language pretraining, achieving both improved performance and scalability while significantly lowering domain adaptation barriers.
📝 Abstract
The emergence of Large Language Models (LLMs) presents unprecedented opportunities to revolutionize medical contrastive vision-language pre-training. In this paper, we show how LLMs can facilitate large-scale supervised pre-training, thereby advancing vision-language alignment. We begin by demonstrate that modern LLMs can automatically extract diagnostic labels from radiology reports with remarkable precision (>96% AUC in our experiments) without complex prompt engineering, enabling the creation of large-scale "silver-standard" datasets at a minimal cost (~$3 for 50k CT image-report pairs). Further, we find that vision encoder trained on this "silver-standard" dataset achieves performance comparable to those trained on labels extracted by specialized BERT-based models, thereby democratizing the access to large-scale supervised pre-training. Building on this foundation, we proceed to reveal that supervised pre-training fundamentally improves contrastive vision-language alignment. Our approach achieves state-of-the-art performance using only a 3D ResNet-18 with vanilla CLIP training, including 83.8% AUC for zero-shot diagnosis on CT-RATE, 77.3% AUC on RAD-ChestCT, and substantial improvements in cross-modal retrieval (MAP@50=53.7% for image-image, Recall@100=52.2% for report-image). These results demonstrate the potential of utilizing LLMs to facilitate {f more performant and scalable} medical AI systems. Our code is avaiable at https://github.com/SadVoxel/More-performant-and-scalable.