Medical thinking with multiple images

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

173K/year
🤖 AI Summary
This work addresses the limitations of existing medical visual question answering (VQA) models, which predominantly rely on single-image inputs and thus fail to support the multi-view image reasoning required in clinical practice. To bridge this gap, the authors introduce MedThinkVQA—the first high-density multi-image medical VQA benchmark, featuring an average of 6.62 images per case—augmented with expert annotations, intermediate supervision, and step-level evaluation protocols. Through systematic evaluation of leading large models and error attribution analysis, they find that even state-of-the-art closed-source and open-source models achieve only 57.2% and 52.2% accuracy, respectively. Over 70% of errors stem from misinterpretation of individual images and failures in integrating evidence across views, underscoring the critical role of reliable visual grounding in multi-image diagnostic reasoning.

Technology Category

Application Category

📝 Abstract
Large language models perform well on many medical QA benchmarks, but real clinical reasoning often requires integrating evidence across multiple images rather than interpreting a single view. We introduce MedThinkVQA, an expert-annotated benchmark for thinking with multiple images, where models must interpret each image, combine cross-view evidence, and answer diagnostic questions with intermediate supervision and step-level evaluation. The dataset contains 8,067 cases, including 720 test cases, with an average of 6.62 images per case, substantially denser than prior work, whose expert-level benchmarks use at most 1.43 images per case. On the test set, the best closed-source models, Claude-4.6-Opus, Gemini-3-Pro, and GPT-5.2-xhigh, reach only 57.2%, 55.3%, and 54.9% accuracy, while GPT-5-mini and GPT-5-nano reach 39.7% and 30.8%. Strong open-source models lag behind, led by Qwen3.5-397B-A17B at 52.2% and Qwen3.5-27B at 50.6%. Further analysis identifies grounded multi-image reasoning as the main bottleneck: models often fail to extract, align, and compose evidence across views before higher-level inference can help. Providing expert single-image cues and cross-image summaries improves performance, whereas replacing them with self-generated intermediates reduces accuracy. Step-level analysis shows that over 70% of errors arise from image reading and cross-view integration. Scaling results further show that additional inference-time computation helps only when visual grounding is already reliable; when early evidence extraction is weak, longer reasoning yields limited or unstable gains and can amplify misread cues. These results suggest that the key challenge is not reasoning length alone, but reliable mechanisms for grounding, aligning, and composing distributed evidence across real-world multimodal clinical inputs.
Problem

Research questions and friction points this paper is trying to address.

multi-image reasoning
medical visual question answering
clinical evidence integration
visual grounding
multimodal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-image reasoning
visual grounding
clinical VQA
step-level evaluation
evidence composition
🔎 Similar Papers
No similar papers found.
Zonghai Yao
Zonghai Yao
Umass Amherst
Medical-LLMMulti-agent AI HospitalClinical ReasoningSynthetic DataPatient Education
B
Benlu Wang
Department of Computer Science, Yale University
Yifan Zhang
Yifan Zhang
Assistant professor, Computer Science, Missouri State University
Deep LearningTime Series Forecasting
Junda Wang
Junda Wang
University of Massachusetts Amherst
Natural Language ProcessingCausal InferenceHealthcare
I
Iris Xia
Department of Computer Science, Yale University
Zhipeng Tang
Zhipeng Tang
UMass Amherst
S
Shuo Han
Miner School of Computer and Information Sciences, UMass Lowell
Feiyun Ouyang
Feiyun Ouyang
PostDoc of Umass Lowell
Public HealthComputer ScienceNLPEpidemiology
Z
Zhichao Yang
Manning College of Information and Computer Sciences, UMass Amherst
Arman Cohan
Arman Cohan
Yale University; Allen Institute for AI
Natural Language ProcessingMachine LearningArtificial Intelligence
H
Hong Yu
Manning College of Information and Computer Sciences, UMass Amherst; Center for Healthcare Organization and Implementation Research, VA Bedford Health Care; Miner School of Computer and Information Sciences, UMass Lowell