Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This study addresses the underperformance of medical multimodal large language models (MLLMs) in image classification tasks despite their larger parameter counts and extensive pretraining data compared to conventional deep learning models. By employing feature probing techniques, the authors systematically trace visual information flow at both module and layer levels across 14 open-source medical MLLMs on three representative classification benchmarks. They uncover four distinct failure modes and introduce a novel quantitative metric—“feature evolution health”—to establish a new paradigm for model diagnostics. The analysis identifies four critical bottlenecks limiting performance: visual representation quality, connector fidelity, language model comprehension, and semantic mapping alignment, offering foundational insights to enhance the clinical applicability of medical MLLMs.

Technology Category

Application Category

📝 Abstract

The rise of multimodal large language models (MLLMs) has sparked an unprecedented wave of applications in the field of medical imaging analysis. However, as one of the earliest and most fundamental tasks integrated into this paradigm, medical image classification reveals a sobering reality: state-of-the-art medical MLLMs consistently underperform compared to traditional deep learning models, despite their overwhelming advantages in pre-training data and model parameters. This paradox prompts a critical rethinking: where exactly does the performance degradation originate? In this paper, we conduct extensive experiments on 14 open-source medical MLLMs across three representative image classification datasets. Moving beyond superficial performance benchmarking, we employ feature probing to track the information flow of visual features module-by-module and layer-by-layer throughout the entire MLLM pipeline, enabling explicit visualization of where and how classification signals are distorted, diluted, or overridden. As the first attempt to dissect classification performance degradation in medical MLLMs, our findings reveal four failure modes: 1) quality limitation in visual representation, 2) fidelity loss in connector projection, 3) comprehension deficit in LLM reasoning, and 4) misalignment of semantic mapping. Meanwhile, we introduce quantitative scores that characterize the healthiness of feature evolution, enabling principled comparisons across diverse MLLMs and datasets. Furthermore, we provide insightful discussions centered on the critical barriers that prevent current medical MLLMs from fulfilling their promised clinical potential. We hope that our work provokes rethinking within the community-highlighting that the road from high expectations to clinically deployable MLLMs remains long and winding.

Problem

Research questions and friction points this paper is trying to address.

medical multimodal large language models

image classification

performance degradation

feature representation

semantic misalignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

feature probing

multimodal large language models

medical image classification