Grounding Multimodal Large Language Models with Quantitative Skin Attributes: A Retrieval Study

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

To address the limited interpretability in AI-based skin disease diagnosis, this paper proposes a novel method integrating a multimodal large language model (MLLM) with quantitative dermatological attribute alignment. We fine-tune the MLLM’s image embedding space to explicitly encode quantifiable clinical features—such as lesion area—and evaluate the approach on the SLICE-3D dataset for both attribute prediction and attribute-conditioned semantic retrieval. Crucially, we introduce the first learnable alignment mechanism between the MLLM embedding space and quantitative skin attributes, enabling attribute-driven reasoning traceability and content retrieval. Experimental results demonstrate high accuracy in attribute prediction (MAE < 0.8 cm²) and substantial improvement in attribute–semantic alignment for retrieval (R@1 increased by 23.6%). This framework enhances diagnostic transparency and clinical trustworthiness by grounding model behavior in clinically meaningful, measurable attributes.

Technology Category

Application Category

📝 Abstract

Artificial Intelligence models have demonstrated significant success in diagnosing skin diseases, including cancer, showing the potential to assist clinicians in their analysis. However, the interpretability of model predictions must be significantly improved before they can be used in practice. To this end, we explore the combination of two promising approaches: Multimodal Large Language Models (MLLMs) and quantitative attribute usage. MLLMs offer a potential avenue for increased interpretability, providing reasoning for diagnosis in natural language through an interactive format. Separately, a number of quantitative attributes that are related to lesion appearance (e.g., lesion area) have recently been found predictive of malignancy with high accuracy. Predictions grounded as a function of such concepts have the potential for improved interpretability. We provide evidence that MLLM embedding spaces can be grounded in such attributes, through fine-tuning to predict their values from images. Concretely, we evaluate this grounding in the embedding space through an attribute-specific content-based image retrieval case study using the SLICE-3D dataset.

Problem

Research questions and friction points this paper is trying to address.

Grounding MLLMs with quantitative skin attributes for interpretability

Improving model interpretability through attribute-based predictions

Evaluating attribute-specific image retrieval using SLICE-3D dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning MLLMs to predict quantitative skin attributes

Grounding embedding spaces with lesion appearance concepts

Using attribute-specific content-based image retrieval

🔎 Similar Papers

No similar papers found.