Grounding Multimodal Large Language Models with Quantitative Skin Attributes: A Retrieval Study

📅 2025-08-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited interpretability in AI-based skin disease diagnosis, this paper proposes a novel method integrating a multimodal large language model (MLLM) with quantitative dermatological attribute alignment. We fine-tune the MLLM’s image embedding space to explicitly encode quantifiable clinical features—such as lesion area—and evaluate the approach on the SLICE-3D dataset for both attribute prediction and attribute-conditioned semantic retrieval. Crucially, we introduce the first learnable alignment mechanism between the MLLM embedding space and quantitative skin attributes, enabling attribute-driven reasoning traceability and content retrieval. Experimental results demonstrate high accuracy in attribute prediction (MAE < 0.8 cm²) and substantial improvement in attribute–semantic alignment for retrieval (R@1 increased by 23.6%). This framework enhances diagnostic transparency and clinical trustworthiness by grounding model behavior in clinically meaningful, measurable attributes.

Technology Category

Application Category

📝 Abstract
Artificial Intelligence models have demonstrated significant success in diagnosing skin diseases, including cancer, showing the potential to assist clinicians in their analysis. However, the interpretability of model predictions must be significantly improved before they can be used in practice. To this end, we explore the combination of two promising approaches: Multimodal Large Language Models (MLLMs) and quantitative attribute usage. MLLMs offer a potential avenue for increased interpretability, providing reasoning for diagnosis in natural language through an interactive format. Separately, a number of quantitative attributes that are related to lesion appearance (e.g., lesion area) have recently been found predictive of malignancy with high accuracy. Predictions grounded as a function of such concepts have the potential for improved interpretability. We provide evidence that MLLM embedding spaces can be grounded in such attributes, through fine-tuning to predict their values from images. Concretely, we evaluate this grounding in the embedding space through an attribute-specific content-based image retrieval case study using the SLICE-3D dataset.
Problem

Research questions and friction points this paper is trying to address.

Grounding MLLMs with quantitative skin attributes for interpretability
Improving model interpretability through attribute-based predictions
Evaluating attribute-specific image retrieval using SLICE-3D dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning MLLMs to predict quantitative skin attributes
Grounding embedding spaces with lesion appearance concepts
Using attribute-specific content-based image retrieval
🔎 Similar Papers
No similar papers found.