🤖 AI Summary
This work proposes a mixed-vendor multi-agent dialogue framework to address the susceptibility of single-vendor large language model (LLM) ensembles to shared biases, which hinder the correction of systematic errors in clinical diagnosis. For the first time, the study systematically demonstrates that vendor diversity significantly enhances diagnostic performance and identifies complementary inductive biases as the key underlying mechanism. The proposed system, integrating o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet, achieves state-of-the-art results on both RareBench and DiagnosisArena benchmarks, substantially outperforming single-vendor or multi-instance homogeneous-model approaches in both recall and accuracy.
📝 Abstract
Multi-agent large language model (LLM) systems have emerged as a promising approach for clinical diagnosis, leveraging collaboration among agents to refine medical reasoning. However, most existing frameworks rely on single-vendor teams (e.g., multiple agents from the same model family), which risk correlated failure modes that reinforce shared biases rather than correcting them. We investigate the impact of vendor diversity by comparing Single-LLM, Single-Vendor, and Mixed-Vendor Multi-Agent Conversation (MAC) frameworks. Using three doctor agents instantiated with o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet, we evaluate performance on RareBench and DiagnosisArena. Mixed-vendor configurations consistently outperform single-vendor counterparts, achieving state-of-the-art recall and accuracy. Overlap analysis reveals the underlying mechanism: mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss. These results highlight vendor diversity as a key design principle for robust clinical diagnostic systems.