Can language-guided unsupervised adaptation improve medical image classification using unpaired images and texts?

📅 2024-09-03

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

To address the limitation of supervised learning caused by scarce annotated medical images, this paper proposes a language-guided unsupervised vision-language model (VLM) adaptation method that requires only unpaired medical images and category-level textual descriptions generated by a large language model (LLM), with no image-text pairs. The core innovation is a cross-modal adapter jointly optimized via contrastive entropy loss and prompt tuning—enabling, for the first time in medical imaging, fully unsupervised VLM adaptation without any paired data, thus overcoming the traditional reliance of VLMs on aligned image-text corpora. Leveraging the MedCLIP visual encoder, our approach achieves significant improvements over zero-shot baselines across three benchmark datasets: chest X-ray, diabetic retinopathy, and skin lesion classification. Results demonstrate that classification performance can be effectively enhanced using only unlabeled images and independently generated textual descriptions.

Technology Category

Application Category

📝 Abstract

In medical image classification, supervised learning is challenging due to the scarcity of labeled medical images. To address this, we leverage the visual-textual alignment within Vision-Language Models (VLMs) to enable unsupervised learning of a medical image classifier. In this work, we propose underline{Med}ical underline{Un}supervised underline{A}daptation ( exttt{MedUnA}) of VLMs, where the LLM-generated descriptions for each class are encoded into text embeddings and matched with class labels via a cross-modal adapter. This adapter attaches to a visual encoder of exttt{MedCLIP} and aligns the visual embeddings through unsupervised learning, driven by a contrastive entropy-based loss and prompt tuning. Thereby, improving performance in scenarios where textual information is more abundant than labeled images, particularly in the healthcare domain. Unlike traditional VLMs, exttt{MedUnA} uses extbf{unpaired images and text} for learning representations and enhances the potential of VLMs beyond traditional constraints. We evaluate the performance on three chest X-ray datasets and two multi-class datasets (diabetic retinopathy and skin lesions), showing significant accuracy gains over the zero-shot baseline. Our code is available at https://github.com/rumaima/meduna.

Problem

Research questions and friction points this paper is trying to address.

Improving medical image classification with unpaired images and texts

Leveraging Vision-Language Models for unsupervised medical image adaptation

Enhancing accuracy in healthcare with limited labeled medical images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages Vision-Language Models for unsupervised learning

Uses LLM-generated text embeddings for class matching

Aligns visual-textual embeddings via contrastive entropy loss

🔎 Similar Papers

No similar papers found.