MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks

📅 2025-05-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current dermatological vision-language models (VLMs) suffer from insufficient multimodal specialization and weak diagnostic reasoning capabilities. To address this, we introduce MM-Skin—the first large-scale, clinical-dermoscopic-histopathological multimodal dermatology dataset—comprising nearly 10,000 high-quality textbook image-text pairs and over 27,000 instruction-tuned visual question answering (VQA) samples. We propose an instruction-enhanced VQA construction paradigm integrating structured textbook text extraction, cross-modal alignment, and LLM-driven data augmentation. Furthermore, we design domain-adapted supervised fine-tuning (SFT) and alignment training strategies to develop SkinVL, a specialized dermatological VLM. Experiments demonstrate that SkinVL consistently outperforms both general-purpose and medical VLMs across eight dermatology benchmarks, achieving an average accuracy gain of 12.6% on VQA, supervised fine-tuning, and zero-shot classification tasks—significantly advancing fine-grained dermatological diagnosis and zero-shot generalization.

Technology Category

Application Category

📝 Abstract
Medical vision-language models (VLMs) have shown promise as clinical assistants across various medical fields. However, specialized dermatology VLM capable of delivering professional and detailed diagnostic analysis remains underdeveloped, primarily due to less specialized text descriptions in current dermatology multimodal datasets. To address this issue, we propose MM-Skin, the first large-scale multimodal dermatology dataset that encompasses 3 imaging modalities, including clinical, dermoscopic, and pathological and nearly 10k high-quality image-text pairs collected from professional textbooks. In addition, we generate over 27k diverse, instruction-following vision question answering (VQA) samples (9 times the size of current largest dermatology VQA dataset). Leveraging public datasets and MM-Skin, we developed SkinVL, a dermatology-specific VLM designed for precise and nuanced skin disease interpretation. Comprehensive benchmark evaluations of SkinVL on VQA, supervised fine-tuning (SFT) and zero-shot classification tasks across 8 datasets, reveal its exceptional performance for skin diseases in comparison to both general and medical VLM models. The introduction of MM-Skin and SkinVL offers a meaningful contribution to advancing the development of clinical dermatology VLM assistants. MM-Skin is available at https://github.com/ZwQ803/MM-Skin
Problem

Research questions and friction points this paper is trying to address.

Lack of specialized dermatology vision-language models for detailed diagnostics
Insufficient high-quality dermatology image-text datasets from professional sources
Need for improved performance in skin disease interpretation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale multimodal dermatology dataset MM-Skin
Diverse instruction-following VQA samples generation
Dermatology-specific VLM SkinVL for precise interpretation
🔎 Similar Papers
No similar papers found.
W
Wenqi Zeng
Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
Y
Yuqi Sun
Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
C
Chenxi Ma
Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
Weimin Tan
Weimin Tan
Fudan University
computer visiondeep learningsaliency detectionsmall object detection and recognition
B
Bo Yan
Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China