MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Current evaluations of multimodal large language models (MLLMs) in medical imaging predominantly rely on coarse-grained metrics, which fail to capture the fine-grained reasoning reliability required in clinical practice. To address this limitation, this work proposes MedRCube—a novel, multidimensional, and fine-grained evaluation paradigm that employs a systematic two-stage construction pipeline encompassing task decomposition, fine-grained metric design, and trustworthiness quantification. The framework enables a comprehensive assessment of 33 state-of-the-art MLLMs. Notably, MedRCube introduces a dedicated trustworthiness evaluation subset, revealing a significant positive correlation between model shortcut behaviors and diagnostic performance—exposing critical flaws undetectable by conventional evaluation methods. Experimental results identify Lingshu-32B as the top-performing model under MedRCube, establishing a new benchmark for trustworthy evaluation of medical MLLMs.

Technology Category

Application Category

📝 Abstract

The potential of Multimodal Large Language Models (MLLMs) in domain of medical imaging raise the demands of systematic and rigorous evaluation frameworks that are aligned with the real-world medical imaging practice. Existing practices that report single or coarse-grained metrics are lack the granularity required for specialized clinical support and fail to assess the reliability of reasoning mechanisms. To address this, we propose a paradigm shift toward multidimensional, fine-grained and in-depth evaluation. Based on a two-stage systematic construction pipeline designed for this paradigm, we instantiate it with MedRCube. We benchmark 33 MLLMs, \textit{Lingshu-32B} achieve top-tier performance. Crucially, MedRCube exposes a series of pronounced insights inaccessible under prior evaluation settings. Furthermore, we introduce a credibility evaluation subset to quantify reasoning credibility, uncover a highly significant positive association between shortcut behavior and diagnostic task performance, raising concerns for clinically trustworthy deployment. The resources of this work can be found at https://github.com/F1mc/MedRCube.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

Medical Imaging

Evaluation Framework

Fine-Grained Evaluation

Reasoning Credibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

multidimensional evaluation

fine-grained assessment

medical imaging MLLMs