Visual Prompt Engineering for Vision Language Models in Radiology

📅 2024-08-28
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Medical image zero-shot classification faces challenges including global feature dominance, weak lesion localization, and insufficient interpretability—particularly critical in radiology, where modeling localized pathological regions is essential. To address this, we introduce, for the first time, a systematic visual prompting framework—comprising arrows, bounding boxes, and circles—into zero-shot chest X-ray classification. These prompts are directly embedded into images to guide a CLIP-based vision-language model toward clinically salient regions, establishing a radiology-oriented local-attention guidance paradigm. Evaluated on four public benchmarks, our method achieves up to a 0.185 improvement in AUROC. Attention heatmaps quantitatively confirm enhanced focus on pathological regions, demonstrating simultaneous gains in diagnostic accuracy and clinical interpretability.

Technology Category

Application Category

📝 Abstract
Medical image classification plays a crucial role in clinical decision-making, yet most models are constrained to a fixed set of predefined classes, limiting their adaptability to new conditions. Contrastive Language-Image Pretraining (CLIP) offers a promising solution by enabling zero-shot classification through multimodal large-scale pretraining. However, while CLIP effectively captures global image content, radiology requires a more localized focus on specific pathology regions to enhance both interpretability and diagnostic accuracy. To address this, we explore the potential of incorporating visual cues into zero-shot classification, embedding visual markers $unicode{x2013}$ such as arrows, bounding boxes, and circles $unicode{x2013}$ directly into radiological images to guide model attention. Evaluating across four public chest X-ray datasets, we demonstrate that visual markers improve AUROC by up to 0.185, highlighting their effectiveness in enhancing classification performance. Furthermore, attention map analysis confirms that visual cues help models focus on clinically relevant areas, leading to more interpretable predictions. To support further research, we use public datasets and will release our code and preprocessing pipeline, providing a reference point for future work on localized classification in medical imaging.
Problem

Research questions and friction points this paper is trying to address.

Enhancing radiology image classification
Localizing pathology for better accuracy
Improving interpretability with visual cues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual markers enhance classification accuracy
Localized focus improves diagnostic interpretability
Public datasets and code support future research
🔎 Similar Papers
No similar papers found.