SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The in-context learning (ICL) capabilities of medical multimodal large language models (MLLMs) under few-shot settings remain poorly understood, and no expert-curated benchmark exists for systematic evaluation. Method: We introduce SMMILE—the first expert-designed multimodal medical ICL benchmark—co-developed by 11 medical specialists, covering six clinical specialties and 13 imaging modalities, with 111 questions and 517 image-text-answer triplets. We further propose SMMILE++, an enhanced variant, alongside systematic context perturbation and ordering analysis techniques. Contribution/Results: Experiments across 15 state-of-the-art MLLMs reveal critical limitations: strong recency bias, high sensitivity to noisy examples, and generally weak medical ICL performance—ICL yields only +8.0%–9.4% gains over zero-shot baselines. A single irrelevant demonstration degrades accuracy by up to 9.5%, whereas optimal example ordering improves performance by up to 71%. These findings underscore the fragility and untapped potential of medical ICL and establish SMMILE as a foundational resource for robust evaluation and advancement.

Technology Category

Application Category

📝 Abstract
Multimodal in-context learning (ICL) remains underexplored despite significant potential for domains such as medicine. Clinicians routinely encounter diverse, specialized tasks requiring adaptation from limited examples, such as drawing insights from a few relevant prior cases or considering a constrained set of differential diagnoses. While multimodal large language models (MLLMs) have shown advances in medical visual question answering (VQA), their ability to learn multimodal tasks from context is largely unknown. We introduce SMMILE, the first expert-driven multimodal ICL benchmark for medical tasks. Eleven medical experts curated problems, each including a multimodal query and multimodal in-context examples as task demonstrations. SMMILE encompasses 111 problems (517 question-image-answer triplets) covering 6 medical specialties and 13 imaging modalities. We further introduce SMMILE++, an augmented variant with 1038 permuted problems. A comprehensive evaluation of 15 MLLMs demonstrates that most models exhibit moderate to poor multimodal ICL ability in medical tasks. In open-ended evaluations, ICL contributes only 8% average improvement over zero-shot on SMMILE and 9.4% on SMMILE++. We observe a susceptibility for irrelevant in-context examples: even a single noisy or irrelevant example can degrade performance by up to 9.5%. Moreover, example ordering exhibits a recency bias, i.e., placing the most relevant example last can lead to substantial performance improvements by up to 71%. Our findings highlight critical limitations and biases in current MLLMs when learning multimodal medical tasks from context.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal in-context learning ability in medical tasks
Assessing impact of irrelevant examples on model performance
Analyzing recency bias in example ordering for medical MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert-driven multimodal ICL benchmark for medicine
Comprehensive evaluation of 15 MLLMs
Identifies critical limitations in medical ICL
🔎 Similar Papers
No similar papers found.