Mediator-Guided Multi-Agent Collaboration among Open-Source Models for Medical Decision-Making

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) exhibit weak instruction-following capabilities, lack self-reflective reasoning, and suffer from error-prone interpretations when heterogeneous VLMs are naively combined—particularly in clinical decision-making. To address these limitations, we propose a mediator-guided multi-agent collaboration framework. In this framework, a large language model (LLM) serves as a dynamic mediator agent that orchestrates multiple open-source general-purpose and medical-domain-specific VLMs. Through output exchange, collaborative reflection, and consensus generation, the system enables self-reflective, multimodal reasoning without fine-tuning. To our knowledge, this is the first work to construct an interpretable and robust multimodal clinical decision support system over heterogeneous open-source VLMs. Evaluated on five medical visual question answering benchmarks, our approach significantly outperforms individual VLMs and proprietary GPT-series models, achieving state-of-the-art accuracy and robustness.

Technology Category

Application Category

📝 Abstract
Complex medical decision-making involves cooperative workflows operated by different clinicians. Designing AI multi-agent systems can expedite and augment human-level clinical decision-making. Existing multi-agent researches primarily focus on language-only tasks, yet their extension to multimodal scenarios remains challenging. A blind combination of diverse vision-language models (VLMs) can amplify an erroneous outcome interpretation. VLMs in general are less capable in instruction following and importantly self-reflection, compared to large language models (LLMs) of comparable sizes. This disparity largely constrains VLMs' ability in cooperative workflows. In this study, we propose MedOrch, a mediator-guided multi-agent collaboration framework for medical multimodal decision-making. MedOrch employs an LLM-based mediator agent that enables multiple VLM-based expert agents to exchange and reflect on their outputs towards collaboration. We utilize multiple open-source general-purpose and domain-specific VLMs instead of costly GPT-series models, revealing the strength of heterogeneous models. We show that the collaboration within distinct VLM-based agents can surpass the capabilities of any individual agent. We validate our approach on five medical vision question answering benchmarks, demonstrating superior collaboration performance without model training. Our findings underscore the value of mediator-guided multi-agent collaboration in advancing medical multimodal intelligence. Our code will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Extending multi-agent systems to multimodal medical decision-making scenarios
Addressing vision-language models' limitations in instruction following and self-reflection
Enhancing collaboration among heterogeneous models without costly GPT-series models
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based mediator guides VLM expert agents
Heterogeneous open-source models replace costly GPT
Multi-agent collaboration without model training
K
Kaitao Chen
Fudan University
M
Mianxin Liu
Shanghai AI Laboratory
D
Daoming Zong
East China Normal University
C
Chaoyue Ding
Fudan University
Shaohao Rui
Shaohao Rui
PhD student, SJTU & SHAI Lab & SII
World ModelsVideo GenVLMLLM
Y
Yankai Jiang
Shanghai AI Laboratory
M
Mu Zhou
Rutgers University
Xiaosong Wang
Xiaosong Wang
Shanghai AI Laboratory
Medical Image AnalysisComputer VisionVision and Language