MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

187K/year
🤖 AI Summary
Current evaluations of multimodal large language models (MLLMs) in medical imaging predominantly rely on coarse-grained metrics, which fail to capture the fine-grained reasoning reliability required in clinical practice. To address this limitation, this work proposes MedRCube—a novel, multidimensional, and fine-grained evaluation paradigm that employs a systematic two-stage construction pipeline encompassing task decomposition, fine-grained metric design, and trustworthiness quantification. The framework enables a comprehensive assessment of 33 state-of-the-art MLLMs. Notably, MedRCube introduces a dedicated trustworthiness evaluation subset, revealing a significant positive correlation between model shortcut behaviors and diagnostic performance—exposing critical flaws undetectable by conventional evaluation methods. Experimental results identify Lingshu-32B as the top-performing model under MedRCube, establishing a new benchmark for trustworthy evaluation of medical MLLMs.

Technology Category

Application Category

📝 Abstract
The potential of Multimodal Large Language Models (MLLMs) in domain of medical imaging raise the demands of systematic and rigorous evaluation frameworks that are aligned with the real-world medical imaging practice. Existing practices that report single or coarse-grained metrics are lack the granularity required for specialized clinical support and fail to assess the reliability of reasoning mechanisms. To address this, we propose a paradigm shift toward multidimensional, fine-grained and in-depth evaluation. Based on a two-stage systematic construction pipeline designed for this paradigm, we instantiate it with MedRCube. We benchmark 33 MLLMs, \textit{Lingshu-32B} achieve top-tier performance. Crucially, MedRCube exposes a series of pronounced insights inaccessible under prior evaluation settings. Furthermore, we introduce a credibility evaluation subset to quantify reasoning credibility, uncover a highly significant positive association between shortcut behavior and diagnostic task performance, raising concerns for clinically trustworthy deployment. The resources of this work can be found at https://github.com/F1mc/MedRCube.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Medical Imaging
Evaluation Framework
Fine-Grained Evaluation
Reasoning Credibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

multidimensional evaluation
fine-grained assessment
medical imaging MLLMs
reasoning credibility
shortcut behavior
🔎 Similar Papers
No similar papers found.