TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Current medical vision-language model benchmarks predominantly focus on single-visit image analysis, overlooking the clinical necessity of longitudinal reasoning grounded in temporal patient histories. To address this gap, we introduce TemMed-Bench—the first benchmark for cross-visit temporal medical image reasoning—comprising three tasks: temporal visual question answering (VQA), longitudinal report generation, and image-pair selection. Built upon 17,000+ real-world clinical cases, it supports both closed-book and open-book evaluation and incorporates a multimodal retrieval-augmented inference framework. Experiments reveal that mainstream models perform near-randomly on temporal reasoning, while advanced models (e.g., GPT-4o, Claude 3.5 Sonnet) achieve relatively stronger performance. Retrieval augmentation yields an average +2.59% VQA accuracy gain. TemMed-Bench fills a critical void in evaluating temporal medical vision-language reasoning and advances model development toward clinically realistic longitudinal decision-making.

Technology Category

Application Category

📝 Abstract

Existing medical reasoning benchmarks for vision-language models primarily focus on analyzing a patient's condition based on an image from a single visit. However, this setting deviates significantly from real-world clinical practice, where doctors typically refer to a patient's historical conditions to provide a comprehensive assessment by tracking their changes over time. In this paper, we introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients' conditions between different clinical visits, which challenges large vision-language models (LVLMs) to reason over temporal medical images. TemMed-Bench consists of a test set comprising three tasks - visual question-answering (VQA), report generation, and image-pair selection - and a supplementary knowledge corpus of over 17,000 instances. With TemMed-Bench, we conduct an evaluation of six proprietary and six open-source LVLMs. Our results show that most LVLMs lack the ability to analyze patients' condition changes over temporal medical images, and a large proportion perform only at a random-guessing level in the closed-book setting. In contrast, GPT o3, o4-mini and Claude 3.5 Sonnet demonstrate comparatively decent performance, though they have yet to reach the desired level. Furthermore, we explore augmenting the input with both retrieved visual and textual modalities in the medical domain. We also show that multi-modal retrieval augmentation yields notably higher performance gains than no retrieval and textual retrieval alone across most models on our benchmark, with the VQA task showing an average improvement of 2.59%. Overall, we compose a benchmark grounded on real-world clinical practice, and it reveals LVLMs' limitations in temporal medical image reasoning, as well as highlighting the use of multi-modal retrieval augmentation as a potentially promising direction worth exploring to address this challenge.

Problem

Research questions and friction points this paper is trying to address.

Evaluating vision-language models' ability to analyze medical image changes over time

Assessing temporal reasoning capabilities using multi-visit clinical image comparisons

Benchmarking performance on medical condition tracking across different patient visits

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces TemMed-Bench for temporal medical image analysis

Evaluates LVLMs on multi-visit medical image reasoning tasks

Uses multi-modal retrieval augmentation to improve model performance

🔎 Similar Papers

LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models