MedAtlas: Evaluating LLMs for Multi-Round, Multi-Task Medical Reasoning Across Diverse Imaging Modalities and Clinical Text

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing medical multimodal benchmarks predominantly support single-image, single-turn tasks, failing to reflect the clinical need for joint reasoning over multimodal imaging (e.g., CT/MRI/PET) and longitudinal textual patient histories. Method: We introduce the first high-fidelity multimodal medical reasoning evaluation framework supporting multi-turn dialogue, multi-image collaborative analysis, and comprehensive diagnostic reasoning. It integrates multimodal large language models, medical image understanding, temporal text reasoning, and expert-annotated ground-truth standards to faithfully emulate real-world clinical workflows. Contribution/Results: We propose two novel metrics—“multi-turn chain accuracy” and “error propagation resistance”—enabling the first systematic assessment of large models across multi-stage, multi-task, and multimodal clinical reasoning. Empirical evaluation reveals significant performance bottlenecks in longitudinal reasoning, establishing a reproducible, robust, and trustworthy benchmark for advancing medical AI.

Technology Category

Application Category

📝 Abstract
Artificial intelligence has demonstrated significant potential in clinical decision-making; however, developing models capable of adapting to diverse real-world scenarios and performing complex diagnostic reasoning remains a major challenge. Existing medical multi-modal benchmarks are typically limited to single-image, single-turn tasks, lacking multi-modal medical image integration and failing to capture the longitudinal and multi-modal interactive nature inherent to clinical practice. To address this gap, we introduce MedAtlas, a novel benchmark framework designed to evaluate large language models on realistic medical reasoning tasks. MedAtlas is characterized by four key features: multi-turn dialogue, multi-modal medical image interaction, multi-task integration, and high clinical fidelity. It supports four core tasks: open-ended multi-turn question answering, closed-ended multi-turn question answering, multi-image joint reasoning, and comprehensive disease diagnosis. Each case is derived from real diagnostic workflows and incorporates temporal interactions between textual medical histories and multiple imaging modalities, including CT, MRI, PET, ultrasound, and X-ray, requiring models to perform deep integrative reasoning across images and clinical texts. MedAtlas provides expert-annotated gold standards for all tasks. Furthermore, we propose two novel evaluation metrics: Round Chain Accuracy and Error Propagation Resistance. Benchmark results with existing multi-modal models reveal substantial performance gaps in multi-stage clinical reasoning. MedAtlas establishes a challenging evaluation platform to advance the development of robust and trustworthy medical AI.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for multi-round medical reasoning tasks
Assessing multi-modal image and clinical text integration
Measuring performance gaps in multi-stage clinical reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn dialogue for medical reasoning
Multi-modal medical image interaction
Novel evaluation metrics for clinical AI
🔎 Similar Papers
No similar papers found.
R
Ronghao Xu
School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui 230026, P.R. China
Z
Zhen Huang
School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, P.R. China
Y
Yangbo Wei
School of Information Science and Technology, Eastern Institute of Technology, Ningbo 315200, P.R. China
X
Xiaoqian Zhou
School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui 230026, P.R. China
Zikang Xu
Zikang Xu
Institute of Artificial Intelligence, Hefei comprehensive national science center
Algorithm FairnessMedical Image AnalysisMEG data analysis
T
Ting Liu
School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui 230026, P.R. China
Zihang Jiang
Zihang Jiang
School of Biomedical Engineering, USTC, Suzhou Institute for Advanced Research
Computer VisionMedical Imaging3D
S
S. Kevin Zhou
School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui 230026, P.R. China