MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological Assessment

📅 2025-05-14

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

Dermatological zero-shot diagnosis is hindered by the limited capacity of existing vision-language pretraining (VLP) models to model lengthy clinical text and insufficient structured representation of domain-specific knowledge. To address this, we propose K-VLP—a novel VLP framework introducing (i) multi-perspective knowledge decomposition contrastive learning, (ii) diagnosis-guided subheading weighting, and (iii) an LLM-driven paradigm for knowledge-augmented subtext generation—the first to integrate large language models (LLMs) into the VLP pretraining pipeline for structured dermatological knowledge modeling. Leveraging LLM-based knowledge distillation, multi-granularity image-text contrastive learning, and dynamic diagnostic prior weighting, K-VLP is pretrained on 403K dermatology image–text pairs. It achieves state-of-the-art performance across eight zero-shot tasks—including disease classification, concept annotation, and cross-modal retrieval—demonstrating substantial improvements in fine-grained lesion understanding and generalization capability.

Technology Category

Application Category

📝 Abstract

Dermatological diagnosis represents a complex multimodal challenge that requires integrating visual features with specialized clinical knowledge. While vision-language pretraining (VLP) has advanced medical AI, its effectiveness in dermatology is limited by text length constraints and the lack of structured texts. In this paper, we introduce MAKE, a Multi-Aspect Knowledge-Enhanced vision-language pretraining framework for zero-shot dermatological tasks. Recognizing that comprehensive dermatological descriptions require multiple knowledge aspects that exceed standard text constraints, our framework introduces: (1) a multi-aspect contrastive learning strategy that decomposes clinical narratives into knowledge-enhanced sub-texts through large language models, (2) a fine-grained alignment mechanism that connects subcaptions with diagnostically relevant image features, and (3) a diagnosis-guided weighting scheme that adaptively prioritizes different sub-captions based on clinical significance prior. Through pretraining on 403,563 dermatological image-text pairs collected from education resources, MAKE significantly outperforms state-of-the-art VLP models on eight datasets across zero-shot skin disease classification, concept annotation, and cross-modal retrieval tasks. Our code will be made publicly available at https: //github.com/SiyuanYan1/MAKE.

Problem

Research questions and friction points this paper is trying to address.

Integrating visual features with clinical knowledge for dermatological diagnosis

Overcoming text length constraints in vision-language pretraining for dermatology

Enhancing zero-shot dermatological tasks with multi-aspect knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-aspect contrastive learning with knowledge-enhanced sub-texts

Fine-grained alignment of subcaptions and image features

Diagnosis-guided weighting for clinical significance prioritization

🔎 Similar Papers

No similar papers found.