🤖 AI Summary
This work addresses the limitations of monolithic large language models in diagnosing fine-grained, large-scale multi-class, and rare skin diseases—challenges arising from data sparsity and the lack of interpretability and traceability. To overcome these issues, we propose a multimodal collaborative multi-agent system that emulates the clinical workflow of dermatologists and introduces, for the first time, a self-evolving dermatological memory mechanism that transcends the constraints of static knowledge bases. By integrating multi-agent collaboration, multimodal alignment, and fine-grained classification, our approach achieves a 9.6% accuracy gain on DDI31 and a 13% improvement in weighted F1 score on Dermnet. It also significantly outperforms existing methods on a fine-grained dataset encompassing 498 disease classes and on a novel rare-disease benchmark comprising 564 samples across eight rare conditions, enabling transparent, trustworthy, and clinically viable dermatological diagnosis.
📝 Abstract
While recent advancements in Large Language Models have significantly advanced dermatological diagnosis, monolithic LLMs frequently struggle with fine-grained, large-scale multi-class diagnostic tasks and rare skin disease diagnosis owing to training data sparsity, while also lacking the interpretability and traceability essential for clinical reasoning. Although multi-agent systems can offer more transparent and explainable diagnostics, existing frameworks are primarily concentrated on Visual Question Answering and conversational tasks, and their heavy reliance on static knowledge bases restricts adaptability in complex real-world clinical settings. Here, we present SkinGPT-X, a multimodal collaborative multi-agent system for dermatological diagnosis integrated with a self-evolving dermatological memory mechanism. By simulating the diagnostic workflow of dermatologists and enabling continuous memory evolution, SkinGPT-X delivers transparent and trustworthy diagnostics for the management of complex and rare dermatological cases. To validate the robustness of SkinGPT-X, we design a three-tier comparative experiment. First, we benchmark SkinGPT-X against four state-of-the-art LLMs across four public datasets, demonstrating its state-of-the-art performance with a +9.6% accuracy improvement on DDI31 and +13% weighted F1 gain on Dermnet over the state-of-the-art model. Second, we construct a large-scale multi-class dataset covering 498 distinct dermatological categories to evaluate its fine-grained classification capabilities. Finally, we curate the rare skin disease dataset, the first benchmark to address the scarcity of clinical rare skin diseases which contains 564 clinical samples with eight rare dermatological diseases. On this dataset, SkinGPT-X achieves a +9.8% accuracy improvement, a +7.1% weighted F1 improvement, a +10% Cohen's Kappa improvement.