3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark

📅 2025-03-26

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Large vision-language models (LVLMs) exhibit insufficient synergy between diagnostic accuracy and professional dialogue capability in telemedicine applications. Method: We introduce the first multimodal, multi-agent dialogue benchmark specifically designed for remote medical consultation. We propose a temperament-aware patient modeling approach, instantiate four behaviorally distinct patient agents and an automated evaluation agent to realistically simulate image-text dual-modality interactions, and pioneer a collaborative reasoning paradigm that injects top-k CNN-derived diagnostic predictions into the LVLM’s context. Results: Our dialogue mechanism improves F1 by 3.8 points; integrating image-text inputs yields an additional 1.4-point gain; the CNN-LVLM collaborative method achieves an F1 score of 70.3—significantly outperforming all baselines. This work establishes a novel benchmark, a new collaborative reasoning paradigm, and effective methods for trustworthy LVLM-driven telemedicine.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) are increasingly being explored for applications in telemedicine, yet their ability to engage with diverse patient behaviors remains underexplored. We introduce 3MDBench (Medical Multimodal Multi-agent Dialogue Benchmark), an open-source evaluation framework designed to assess LLM-driven medical consultations. Unlike existing benchmarks, 3MDBench simulates real-world patient variability by incorporating four temperament-driven Patient Agents and an Assessor Agent that evaluates diagnostic accuracy and dialogue quality. The benchmark integrates textual and image-based patient data across 34 common diagnoses, mirroring real-world telemedicine interactions. Under different diagnostic strategies, we evaluate state-of-the-art LVLMs. Our findings demonstrate that incorporating dialogue improves the F1 score from 50.4 to 54.2 compared to non-dialogue settings, underscoring the value of context-driven, information-seeking questioning. Additionally, we demonstrate that multimodal inputs enhance diagnostic efficiency. Image-supported models outperform text-only counterparts by raising the diagnostic F1 score from 52.8 to 54.2 in a similar dialogue setting. Finally, we suggest an approach that improves the diagnostic F1-score to 70.3 by training the CNN model on the diagnosis prediction task and incorporating its top-3 predictions into the LVLM context. 3MDBench provides a reproducible and extendable evaluation framework for AI-driven medical assistants. It offers insights into how patient temperament, dialogue strategies, and multimodal reasoning influence diagnosis quality. By addressing real-world complexities in telemedicine, our benchmark paves the way for more empathetic, reliable, and context-aware AI-driven healthcare solutions. The source code of our benchmark is publicly available: https://github.com/univanxx/3mdbench

Problem

Research questions and friction points this paper is trying to address.

Evaluating LVLMs' telemedicine diagnostic and dialogue abilities

Simulating patient variability for accurate medical consultations

Improving diagnostic accuracy with multimodal dialogue and reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal multi-agent dialogue framework

Temperament-based Patient Agents simulation

Diagnostic convolutional network integration

🔎 Similar Papers

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments