Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of multimodal retrieval between images and clinical text in skin cancer diagnosis and treatment by proposing a Transformer-based global–local joint alignment framework. The method integrates global semantic supervision with a multi-scale spatial attention mechanism and incorporates a clinical prior–driven convex weighting strategy to enhance both discriminative region alignment and the interpretability of similarity computation. Experimental results on the Derm7pt dataset demonstrate that the proposed approach significantly outperforms existing state-of-the-art methods, enabling efficient and accurate case retrieval to support clinical decision-making, medical education, and quality assurance scenarios.

Technology Category

Application Category

📝 Abstract
Medical image retrieval aims to identify clinically relevant lesion cases to support diagnostic decision making, education, and quality control. In practice, retrieval queries often combine a reference lesion image with textual descriptors such as dermoscopic features. We study composed vision-language retrieval for skin cancer, where each query consists of an image to text pair and the database contains biopsy-confirmed, multi-class disease cases. We propose a transformer based framework that learns hierarchical composed query representations and performs joint global-local alignment between queries and candidate images. Local alignment aggregates discriminative regions via multiple spatial attention masks, while global alignment provides holistic semantic supervision. The final similarity is computed through a convex, domain-informed weighting that emphasizes clinically salient local evidence while preserving global consistency. Experiments on the public Derm7pt dataset demonstrate consistent improvements over state-of-the-art methods. The proposed framework enables efficient access to relevant medical records and supports practical clinical deployment.
Problem

Research questions and friction points this paper is trying to address.

composed vision-language retrieval
skin cancer
medical image retrieval
dermoscopic features
biopsy-confirmed cases
Innovation

Methods, ideas, or system contributions that make the work stand out.

composed vision-language retrieval
global-local alignment
spatial attention masks
hierarchical query representation
domain-informed weighting
🔎 Similar Papers
No similar papers found.
Y
Yuheng Wang
The University of British Columbia, Vancouver, BC, Canada V6T 1Z4
Y
Yuji Lin
Shenzhen University, Shenzhen, Guangdong, China, 518055
D
Dongrun Zhu
Shenzhen University, Shenzhen, Guangdong, China, 518055
J
Jiayue Cai
Shenzhen University, Shenzhen, Guangdong, China, 518055
S
Sunil Kalia
The University of British Columbia, Vancouver, BC, Canada V6T 1Z4
H
Harvey Lui
The University of British Columbia, Vancouver, BC, Canada V6T 1Z4
C
Chunqi Chang
Shenzhen University, Shenzhen, Guangdong, China, 518055
Z. Jane Wang
Z. Jane Wang
Professor of Electrical and Computer Engineering Dept., University of British Columbia, Canada
Signal/Image/Video processingMachine LearningDigital media data analyticsdigital media security & forensicsbiomedical si
Tim K. Lee
Tim K. Lee
BC Cancer Research Centre
artificial intelligencecomputer-aided diagnosispolarization specklesskin cancer detectionrisk and prognosis factors