🤖 AI Summary
Current multimodal large language models face limitations in dermatological image analysis due to insufficient domain knowledge and susceptibility to hallucinations, hindering reliable and traceable diagnoses. This work proposes the first multi-tool collaborative agent system equipped with a planning–execution–reflection mechanism, integrating visual perception, bimodal evidence retrieval—leveraging a corpus of 413,210 images and 3,199 clinical guidelines—and a certainty-aware critique module that evaluates confidence, coverage, and conflict detection. The resulting framework enables fine-grained, interpretable, and hallucination-resistant decision-making. Evaluated across five dermatology benchmarks, the method substantially outperforms existing models, achieving a 17.6% improvement in diagnostic accuracy over GPT-4o and a 3.15% gain in ROUGE-L score.
📝 Abstract
Dermatological diagnosis requires integrating fine-grained visual perception with expert clinical knowledge. Although Multimodal Large Language Models (MLLMs) facilitate interactive medical image analysis, their application in dermatology is hindered by insufficient domain-specific grounding and hallucinations. To address these issues, we propose DermAgent, a collaborative multi-tool agent that orchestrates seven specialized vision and language modules within a Plan-Execute-Reflect framework. DermAgent delivers stepwise, traceable diagnostic reasoning through three core components. First, it employs complementary visual perception tools for comprehensive morphological description, dermoscopic concept annotation, and disease diagnosis. Second, to overcome the lack of domain prior, a dual-modality retrieval module anchors every prediction in external evidence by cross-referencing 413,210 diagnosed image cases and 3,199 clinical guideline chunks. To further mitigate hallucinations, a deterministic critic module conducts strict post-hoc auditing via confidence, coverage, and conflict gates, automatically detecting inter-source disagreements to trigger targeted self-correction. Extensive experiments on five dermatology benchmarks demonstrate that DermAgent consistently outperforms state-of-the-art MLLMs and medical agent baselines across zero-shot fine-grained disease diagnosis, concept annotation, and clinical captioning tasks, exceeding GPT-4o by 17.6% in skin disease diagnostic accuracy and 3.15% in captioning ROUGE-L. Our code is available at https://github.com/YizeezLiu/DermAgent.