Does medical specialization of VLMs enhance discriminative power?: A comprehensive investigation through feature distribution analysis

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluation protocols struggle to determine whether medical vision-language models (VLMs) genuinely learn lesion-specific discriminative features. This work presents the first systematic comparison—through feature distribution analysis—of representation capabilities between domain-specific and general-purpose VLMs on multimodal medical images, employing both visualization and quantitative assessment across multiple lesion classification datasets. The study reveals that contextual enhancement in the text encoder is more critical for improving discriminability than large-scale medical image pretraining; that enhanced non-medical VLMs (e.g., LLM2CLIP) can outperform specialized models; and that general-purpose VLMs are susceptible to biases introduced by overlaid text in medical images. These findings offer a novel perspective for the design and evaluation of medical VLMs.

Technology Category

Application Category

📝 Abstract
This study investigates the feature representations produced by publicly available open source medical vision-language models (VLMs). While medical VLMs are expected to capture diagnostically relevant features, their learned representations remain underexplored, and standard evaluations like classification accuracy do not fully reveal if they acquire truly discriminative, lesion-specific features. Understanding these representations is crucial for revealing medical image structures and improving downstream tasks in medical image analysis. This study aims to investigate the feature distributions learned by medical VLMs and evaluate the impact of medical specialization. We analyze the feature distribution of multiple image modalities extracted by some representative medical VLMs across lesion classification datasets on multiple modalities. These distributions were compared them with non-medical VLMs to assess the domain-specific medical training. Our experiments showed that medical VLMs can extract discriminative features that are effective for medical classification tasks. Moreover, it was found that non-medical VLMs with recent improvement with contextual enrichment such as LLM2CLIP produce more refined feature representations. Our results imply that enhancing text encoder is more crucial than training intensively on medical images when developing medical VLMs. Notably, non-medical models are particularly vulnerable to biases introduced by overlaied text strings on images. These findings underscore the need for careful consideration on model selection according to downstream tasks besides potential risks in inference due to background biases such as textual information in images.
Problem

Research questions and friction points this paper is trying to address.

medical vision-language models
feature representation
discriminative power
medical specialization
lesion-specific features
Innovation

Methods, ideas, or system contributions that make the work stand out.

medical vision-language models
feature distribution analysis
discriminative representation
text encoder enhancement
contextual enrichment
🔎 Similar Papers
No similar papers found.
K
Keita Takeda
Graduate School of Integrated Science and Technology, Nagasaki University
Tomoya Sakai
Tomoya Sakai
IBM Research - Tokyo