RadDiff: Describing Differences in Radiology Image Sets with Natural Language

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Accurately describing clinically meaningful differences between pairs of radiological images remains a key challenge in enhancing the interpretability of medical AI. This work proposes a multimodal agent system that emulates radiologists’ comparative diagnostic workflow by integrating imaging data and clinical reports. The system leverages medical knowledge injection, multimodal iterative reasoning, targeted visual search, and localized region magnification to generate precise natural language descriptions of image discrepancies. Evaluated on RadDiffBench—a newly constructed expert-validated benchmark—the system achieves 47% accuracy (50% under report guidance), substantially outperforming general-purpose baselines. It further demonstrates practical utility in tasks such as comparing COVID-19 phenotypes and analyzing racial subpopulation differences.

Technology Category

Application Category

📝 Abstract

Understanding how two radiology image sets differ is critical for generating clinical insights and for interpreting medical AI systems. We introduce RadDiff, a multimodal agentic system that performs radiologist-style comparative reasoning to describe clinically meaningful differences between paired radiology studies. RadDiff builds on a proposer-ranker framework from VisDiff, and incorporates four innovations inspired by real diagnostic workflows: (1) medical knowledge injection through domain-adapted vision-language models; (2) multimodal reasoning that integrates images with their clinical reports; (3) iterative hypothesis refinement across multiple reasoning rounds; and (4) targeted visual search that localizes and zooms in on salient regions to capture subtle findings. To evaluate RadDiff, we construct RadDiffBench, a challenging benchmark comprising 57 expert-validated radiology study pairs with ground-truth difference descriptions. On RadDiffBench, RadDiff achieves 47% accuracy, and 50% accuracy when guided by ground-truth reports, significantly outperforming the general-domain VisDiff baseline. We further demonstrate RadDiff's versatility across diverse clinical tasks, including COVID-19 phenotype comparison, racial subgroup analysis, and discovery of survival-related imaging features. Together, RadDiff and RadDiffBench provide the first method-and-benchmark foundation for systematically uncovering meaningful differences in radiological data.

Problem

Research questions and friction points this paper is trying to address.

radiology image comparison

clinical difference description

medical imaging analysis

natural language generation

multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal reasoning

vision-language model

iterative hypothesis refinement