Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address imprecise vision-language alignment and poor model interpretability in weakly supervised medical visual grounding, this paper identifies two fundamental limitations in vision-language models (VLMs): excessively high norms of background tokens and insufficient local lesion representation capability of global tokens. We propose Disease-Aware Prompting (DAP), a pixel-level annotation-free method that leverages explainability heatmaps to guide feature reweighting, jointly optimizing token-level attention modulation and lesion-region enhancement to achieve fine-grained alignment between textual descriptions and thoracic X-ray lesions. Evaluated on three mainstream chest X-ray datasets, DAP improves visual grounding accuracy by an average of 20.74% over state-of-the-art methods, significantly enhancing clinical trustworthiness and decision transparency.

Technology Category

Application Category

📝 Abstract

Visual grounding (VG) is the capability to identify the specific regions in an image associated with a particular text description. In medical imaging, VG enhances interpretability by highlighting relevant pathological features corresponding to textual descriptions, improving model transparency and trustworthiness for wider adoption of deep learning models in clinical practice. Current models struggle to associate textual descriptions with disease regions due to inefficient attention mechanisms and a lack of fine-grained token representations. In this paper, we empirically demonstrate two key observations. First, current VLMs assign high norms to background tokens, diverting the model's attention from regions of disease. Second, the global tokens used for cross-modal learning are not representative of local disease tokens. This hampers identifying correlations between the text and disease tokens. To address this, we introduce simple, yet effective Disease-Aware Prompting (DAP) process, which uses the explainability map of a VLM to identify the appropriate image features. This simple strategy amplifies disease-relevant regions while suppressing background interference. Without any additional pixel-level annotations, DAP improves visual grounding accuracy by 20.74% compared to state-of-the-art methods across three major chest X-ray datasets.

Problem

Research questions and friction points this paper is trying to address.

Identifying disease regions in medical images using text descriptions

Improving attention mechanisms for fine-grained token representations

Enhancing visual grounding accuracy without pixel-level annotations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disease-Aware Prompting enhances medical visual grounding

Explainability maps identify disease-relevant image features

Suppresses background interference without pixel-level annotations

🔎 Similar Papers

Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models