SkinCLIP-VL: Consistency-Aware Vision-Language Learning for Multimodal Skin Cancer Diagnosis

📅 2026-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses three key challenges in skin cancer diagnosis using vision-language models: high computational cost, data scarcity, and poor interpretability. To this end, the authors propose an efficient and clinically trustworthy multimodal diagnostic framework that freezes the CLIP visual encoder and integrates a lightweight, quantized Qwen2.5-VL language model. The approach further incorporates low-rank adaptation (LoRA) and a novel consistency-aware focal alignment (CFA) loss to precisely align lesion regions with clinical semantics under long-tailed data distributions. Evaluated on the ISIC and Derm7pt benchmarks, the method achieves 4.3–6.2% higher accuracy than a 13B-parameter baseline while using 43% fewer parameters. Expert blind reviews and out-of-distribution testing confirm its superior interpretability and clinical credibility compared to conventional saliency map techniques.

Technology Category

Application Category

📝 Abstract
The deployment of vision-language models (VLMs) in dermatology is hindered by the trilemma of high computational costs, extreme data scarcity, and the black-box nature of deep learning. To address these challenges, we present SkinCLIP-VL, a resource-efficient framework that adapts foundation models for trustworthy skin cancer diagnosis. Adopting a frozen perception, adaptive reasoning paradigm, we integrate a frozen CLIP encoder with a lightweight, quantized Qwen2.5-VL via low-rank adaptation (LoRA). To strictly align visual regions with clinical semantics under long-tailed distributions, we propose the Consistency-aware Focal Alignment (CFA) Loss. This objective synergizes focal re-weighting, semantic alignment, and calibration. On ISIC and Derm7pt benchmarks, SkinCLIP-VL surpasses 13B-parameter baselines by 4.3-6.2% in accuracy with 43% fewer parameters. Crucially, blinded expert evaluation and out-of-distribution testing confirm that our visually grounded rationales significantly enhance clinical trust compared to traditional saliency maps.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
skin cancer diagnosis
data scarcity
computational cost
black-box nature
Innovation

Methods, ideas, or system contributions that make the work stand out.

Consistency-aware Focal Alignment
Low-rank Adaptation (LoRA)
Vision-Language Model
Skin Cancer Diagnosis
Quantized Multimodal Learning
🔎 Similar Papers
No similar papers found.
Z
Zhixiang Lu
Xi’an Jiaotong-Liverpool University, China
S
Shijie Xu
Xi’an Jiaotong-Liverpool University, China
K
Kaicheng Yan
Xi’an Jiaotong-Liverpool University, China
X
Xuyue Cai
Xi’an Jiaotong-Liverpool University, China
Chong Zhang
Chong Zhang
University of Liverpool
LLM SafetyExplainability for LLMMulti-Agent Systems
Y
Yulong Li
Xi’an Jiaotong-Liverpool University, China
A
Angelos Stefanidis
Xi’an Jiaotong-Liverpool University, China
Anh Nguyen
Anh Nguyen
University of Liverpool
Robotic VisionMachine LearningRobotics
Jionglong Su
Jionglong Su
Xi'an Jiaotong-Liverpool University
AI Big Data Machine Learning Statistics