Can language-guided unsupervised adaptation improve medical image classification using unpaired images and texts?

๐Ÿ“… 2024-09-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the limitation of supervised learning caused by scarce annotated medical images, this paper proposes a language-guided unsupervised vision-language model (VLM) adaptation method that requires only unpaired medical images and category-level textual descriptions generated by a large language model (LLM), with no image-text pairs. The core innovation is a cross-modal adapter jointly optimized via contrastive entropy loss and prompt tuningโ€”enabling, for the first time in medical imaging, fully unsupervised VLM adaptation without any paired data, thus overcoming the traditional reliance of VLMs on aligned image-text corpora. Leveraging the MedCLIP visual encoder, our approach achieves significant improvements over zero-shot baselines across three benchmark datasets: chest X-ray, diabetic retinopathy, and skin lesion classification. Results demonstrate that classification performance can be effectively enhanced using only unlabeled images and independently generated textual descriptions.

Technology Category

Application Category

๐Ÿ“ Abstract
In medical image classification, supervised learning is challenging due to the scarcity of labeled medical images. To address this, we leverage the visual-textual alignment within Vision-Language Models (VLMs) to enable unsupervised learning of a medical image classifier. In this work, we propose underline{Med}ical underline{Un}supervised underline{A}daptation ( exttt{MedUnA}) of VLMs, where the LLM-generated descriptions for each class are encoded into text embeddings and matched with class labels via a cross-modal adapter. This adapter attaches to a visual encoder of exttt{MedCLIP} and aligns the visual embeddings through unsupervised learning, driven by a contrastive entropy-based loss and prompt tuning. Thereby, improving performance in scenarios where textual information is more abundant than labeled images, particularly in the healthcare domain. Unlike traditional VLMs, exttt{MedUnA} uses extbf{unpaired images and text} for learning representations and enhances the potential of VLMs beyond traditional constraints. We evaluate the performance on three chest X-ray datasets and two multi-class datasets (diabetic retinopathy and skin lesions), showing significant accuracy gains over the zero-shot baseline. Our code is available at https://github.com/rumaima/meduna.
Problem

Research questions and friction points this paper is trying to address.

Improving medical image classification with unpaired images and texts
Leveraging Vision-Language Models for unsupervised medical image adaptation
Enhancing accuracy in healthcare with limited labeled medical images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages Vision-Language Models for unsupervised learning
Uses LLM-generated text embeddings for class matching
Aligns visual-textual embeddings via contrastive entropy loss
๐Ÿ”Ž Similar Papers
No similar papers found.