Expanding Relevance Judgments for Medical Case-based Retrieval Task with Multimodal LLMs

📅 2025-06-21

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

In medical case retrieval, manual relevance judgments (qrels) are costly and sparse—especially in multimodal settings requiring joint analysis of text and imaging. This paper introduces, for the first time, a multimodal large language model (Gemini 1.5 Pro) to scale medical relevance annotation. We propose a structured prompting strategy integrating binary relevance scoring, instruction tuning, and few-shot learning to enable automated cross-modal text–image relevance assessment. Evaluated on 35 clinical queries, our method expands the original 15,000 human-annotated pairs to 558,000 (+37×), increasing the number of relevant cases from 709 to 5,950. The automatically generated annotations achieve Cohen’s Kappa = 0.6 against human judgments, demonstrating substantial agreement. This approach effectively alleviates data sparsity, enabling significantly larger-scale and more efficient evaluation of medical retrieval systems.

Technology Category

Application Category

📝 Abstract

Evaluating Information Retrieval (IR) systems relies on high-quality manual relevance judgments (qrels), which are costly and time-consuming to obtain. While pooling reduces the annotation effort, it results in only partially labeled datasets. Large Language Models (LLMs) offer a promising alternative to reducing reliance on manual judgments, particularly in complex domains like medical case-based retrieval, where relevance assessment requires analyzing both textual and visual information. In this work, we explore using a Multimodal Large Language Model (MLLM) to expand relevance judgments, creating a new dataset of automated judgments. Specifically, we employ Gemini 1.5 Pro on the ImageCLEFmed 2013 case-based retrieval task, simulating human assessment through an iteratively refined, structured prompting strategy that integrates binary scoring, instruction-based evaluation, and few-shot learning. We systematically experimented with various prompt configurations to maximize agreement with human judgments. To evaluate agreement between the MLLM and human judgments, we use Cohen's Kappa, achieving a substantial agreement score of 0.6, comparable to inter-annotator agreement typically observed in multimodal retrieval tasks. Starting from the original 15,028 manual judgments (4.72% relevant) across 35 topics, our MLLM-based approach expanded the dataset by over 37x to 558,653 judgments, increasing relevant annotations to 5,950. On average, each medical case query received 15,398 new annotations, with approximately 99% being non-relevant, reflecting the high sparsity typical in this domain. Our results demonstrate the potential of MLLMs to scale relevance judgment collection, offering a promising direction for supporting retrieval evaluation in medical and multimodal IR tasks.

Problem

Research questions and friction points this paper is trying to address.

Reduce reliance on costly manual relevance judgments in IR systems

Expand relevance judgments using Multimodal LLMs for medical case retrieval

Improve dataset coverage and annotation efficiency in multimodal IR tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLM expands medical relevance judgments

Structured prompting with binary scoring and few-shot learning

Scalable automated judgments achieve substantial human agreement

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval