Benchmarking the Thinking Mode of Multimodal Large Language Models in Clinical Tasks

📅 2025-11-05

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Chain-of-thought (CoT) reasoning—a key capability in large language models—has not been systematically evaluated for its impact on the performance and reliability of multimodal large language models (MLLMs) in clinical tasks. Method: We conduct the first dual-state (CoT-enabled vs. CoT-disabled) comparative evaluation of two state-of-the-art MLLMs—Seed1.5-VL and Gemini-2.5-Flash—on medical visual question answering and image interpretation using the VQA-RAD and ROCOv2 benchmarks. Contribution/Results: CoT yields only marginal improvements across most clinical scenarios, revealing fundamental limitations of generic reasoning mechanisms in domain-specific semantic understanding, anatomical logical inference, and fine-grained discrimination. We propose a synergistic optimization framework integrating domain-knowledge enhancement with structured prompting, offering empirical grounding and methodological insights for developing trustworthy, clinically grounded reasoning paradigms in medical MLLMs.

Technology Category

Application Category

📝 Abstract

A recent advancement in Multimodal Large Language Models (MLLMs) research is the emergence of"reasoning MLLMs"that offer explicit control over their internal thinking processes (normally referred as the"thinking mode") alongside the standard"non-thinking mode". This capability allows these models to engage in a step-by-step process of internal deliberation before generating a final response. With the rapid transition to and adoption of these"dual-state"MLLMs, this work rigorously evaluated how the enhanced reasoning processes of these MLLMs impact model performance and reliability in clinical tasks. This paper evaluates the active"thinking mode"capabilities of two leading MLLMs, Seed1.5-VL and Gemini-2.5-Flash, for medical applications. We assessed their performance on four visual medical tasks using VQA-RAD and ROCOv2 datasets. Our findings reveal that the improvement from activating the thinking mode remains marginal compared to the standard non-thinking mode for the majority of the tasks. Their performance on complex medical tasks such as open-ended VQA and medical image interpretation remains suboptimal, highlighting the need for domain-specific medical data and more advanced methods for medical knowledge integration.

Problem

Research questions and friction points this paper is trying to address.

Evaluating thinking mode impact on clinical task performance

Assessing reasoning MLLMs' reliability in medical applications

Analyzing limitations in complex medical VQA and interpretation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating dual-state MLLMs with thinking mode

Testing on clinical datasets like VQA-RAD and ROCOv2

Finding marginal gains from activated reasoning processes

🔎 Similar Papers

No similar papers found.