Location-Aware Pretraining for Medical Difference Visual Question Answering

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Existing visual encoders struggle to capture subtle yet clinically significant differences in medical images, limiting the performance of multi-image contrastive medical visual question answering (VQA). This work proposes the first location-aware pretraining framework tailored for medical difference VQA. By integrating tasks such as Automatic Referring Expression (AREF), Grounded Captioning (GCAP), and Conditional Referring Expression (CAREF), the framework enables the visual encoder to learn fine-grained, spatially localized visual representations and jointly model them with a language module. This approach substantially enhances the model’s ability to perceive and reason about lesion locations and their changes. Evaluated on chest X-ray difference VQA tasks, the method achieves state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

Unlike conventional single-image models, differential medical VQA frameworks process multiple images to identify differences, mirroring the comparative diagnostic workflow of radiologists. However, standard vision encoders trained on contrastive or classification objectives often fail to capture the subtle visual variations necessary for distinguishing disease progression from acquisition differences. To address this limitation, we introduce a pretraining framework that incorporates location-aware tasks, including automatic referring expressions (AREF), grounded captioning (GCAP), and conditional automatic referring expressions (CAREF). These specific tasks enable the vision encoder to learn fine-grained, spatially grounded visual representations that are often overlooked by traditional pre-training methods. We subsequently integrate this enhanced vision encoder with a language model to perform medical difference VQA. Experimental results demonstrate that our approach achieves state-of-the-art performance in detecting and reasoning about clinically relevant changes in chest X-ray images.

Problem

Research questions and friction points this paper is trying to address.

medical difference visual question answering

subtle visual variations

disease progression

acquisition differences

vision encoder

Innovation

Methods, ideas, or system contributions that make the work stand out.

location-aware pretraining

medical difference VQA

automatic referring expressions