🤖 AI Summary
To address the limitations of large language models (LLMs) in complex clinical diagnosis—namely, isolated reasoning and non-reusable experiential knowledge—this paper proposes a self-evolving multi-agent clinical diagnostic framework. The framework employs an iterative, human-in-the-loop process involving diagnostic agents, evaluation agents, and domain experts to enable autonomous clinical knowledge distillation and continuous knowledge accumulation—achieving cross-model generalizability, transferability, and personalization. Built upon open-source models including Llama-3.1 and DeepSeek-R1-Distill-Llama, the system supports traceable reasoning and human-AI collaborative decision-making. Experiments on 4,390 real-world cases spanning seven disease categories demonstrate that the framework achieves up to a 22.3% improvement in primary diagnosis accuracy over standard clinical guidelines, outperforms physician-only diagnosis by 16%, and delivers an 18.6% gain under human-AI collaboration—significantly enhancing diagnostic robustness and interpretability.
📝 Abstract
Large language models (LLMs) have demonstrated notable potential in medical applications, yet they face substantial challenges in handling complex real-world clinical diagnoses using conventional prompting methods. Current prompt engineering and multi-agent approaches typically optimize isolated inferences, neglecting the accumulation of reusable clinical experience. To address this, this study proposes a novel Multi-Agent Clinical Diagnosis (MACD) framework, which allows LLMs to self-learn clinical knowledge via a multi-agent pipeline that summarizes, refines, and applies diagnostic insights. It mirrors how physicians develop expertise through experience, enabling more focused and accurate diagnosis on key disease-specific cues. We further extend it to a MACD-human collaborative workflow, where multiple LLM-based diagnostician agents engage in iterative consultations, supported by an evaluator agent and human oversight for cases where agreement is not reached. Evaluated on 4,390 real-world patient cases across seven diseases using diverse open-source LLMs (Llama-3.1 8B/70B, DeepSeek-R1-Distill-Llama 70B), MACD significantly improves primary diagnostic accuracy, outperforming established clinical guidelines with gains up to 22.3% (MACD). On the subset of the data, it achieves performance on par with or exceeding that of human physicians (up to 16% improvement over physicians-only diagnosis). Additionally, on the MACD-human workflow, it achieves an 18.6% improvement compared to physicians-only diagnosis. Moreover, self-learned knowledge exhibits strong cross-model stability, transferability, and model-specific personalization, while the system can generate traceable rationales, enhancing explainability. Consequently, this work presents a scalable self-learning paradigm for LLM-assisted diagnosis, bridging the gap between the intrinsic knowledge of LLMs and real-world clinical practice.