Describe Anything in Medical Images

๐Ÿ“… 2025-05-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the scarcity of supervised data and insufficient clinical factuality in region-level fine-grained captioning of medical images, this paper proposes MedDAMโ€”the first region-aware medical image captioning framework. Methodologically, it introduces (1) a ground-truth-free, attribute-verification benchmark integrating multimodal expert prompts, domain-specific QA templates, and standardized preprocessing; and (2) a large vision-language model (LVLM)-based generation pipelineโ€”leveraging GPT-4o and Qwen2.5-VLโ€”enhanced with modality-specific, expert-crafted prompts and an attribute-level clinical factuality verification mechanism. Evaluated on VinDr-CXR, LIDC-IDRI, and SkinCon, MedDAM significantly outperforms strong baselines including GPT-4o and Claude 3.7 Sonnet. Results demonstrate that precise region-semantic alignment is critical for improving clinical interpretability and factual accuracy in medical image description.

Technology Category

Application Category

๐Ÿ“ Abstract
Localized image captioning has made significant progress with models like the Describe Anything Model (DAM), which can generate detailed region-specific descriptions without explicit region-text supervision. However, such capabilities have yet to be widely applied to specialized domains like medical imaging, where diagnostic interpretation relies on subtle regional findings rather than global understanding. To mitigate this gap, we propose MedDAM, the first comprehensive framework leveraging large vision-language models for region-specific captioning in medical images. MedDAM employs medical expert-designed prompts tailored to specific imaging modalities and establishes a robust evaluation benchmark comprising a customized assessment protocol, data pre-processing pipeline, and specialized QA template library. This benchmark evaluates both MedDAM and other adaptable large vision-language models, focusing on clinical factuality through attribute-level verification tasks, thereby circumventing the absence of ground-truth region-caption pairs in medical datasets. Extensive experiments on the VinDr-CXR, LIDC-IDRI, and SkinCon datasets demonstrate MedDAM's superiority over leading peers (including GPT-4o, Claude 3.7 Sonnet, LLaMA-3.2 Vision, Qwen2.5-VL, GPT-4Rol, and OMG-LLaVA) in the task, revealing the importance of region-level semantic alignment in medical image understanding and establishing MedDAM as a promising foundation for clinical vision-language integration.
Problem

Research questions and friction points this paper is trying to address.

Enabling region-specific captioning in medical images
Addressing lack of ground-truth region-caption pairs
Improving clinical factuality in medical image interpretation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages large vision-language models for medical images
Uses expert-designed prompts for specific imaging modalities
Establishes robust evaluation benchmark for clinical factuality
๐Ÿ”Ž Similar Papers
No similar papers found.