RadDiff: Describing Differences in Radiology Image Sets with Natural Language

📅 2026-01-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Accurately describing clinically meaningful differences between pairs of radiological images remains a key challenge in enhancing the interpretability of medical AI. This work proposes a multimodal agent system that emulates radiologists’ comparative diagnostic workflow by integrating imaging data and clinical reports. The system leverages medical knowledge injection, multimodal iterative reasoning, targeted visual search, and localized region magnification to generate precise natural language descriptions of image discrepancies. Evaluated on RadDiffBench—a newly constructed expert-validated benchmark—the system achieves 47% accuracy (50% under report guidance), substantially outperforming general-purpose baselines. It further demonstrates practical utility in tasks such as comparing COVID-19 phenotypes and analyzing racial subpopulation differences.

Technology Category

Application Category

📝 Abstract
Understanding how two radiology image sets differ is critical for generating clinical insights and for interpreting medical AI systems. We introduce RadDiff, a multimodal agentic system that performs radiologist-style comparative reasoning to describe clinically meaningful differences between paired radiology studies. RadDiff builds on a proposer-ranker framework from VisDiff, and incorporates four innovations inspired by real diagnostic workflows: (1) medical knowledge injection through domain-adapted vision-language models; (2) multimodal reasoning that integrates images with their clinical reports; (3) iterative hypothesis refinement across multiple reasoning rounds; and (4) targeted visual search that localizes and zooms in on salient regions to capture subtle findings. To evaluate RadDiff, we construct RadDiffBench, a challenging benchmark comprising 57 expert-validated radiology study pairs with ground-truth difference descriptions. On RadDiffBench, RadDiff achieves 47% accuracy, and 50% accuracy when guided by ground-truth reports, significantly outperforming the general-domain VisDiff baseline. We further demonstrate RadDiff's versatility across diverse clinical tasks, including COVID-19 phenotype comparison, racial subgroup analysis, and discovery of survival-related imaging features. Together, RadDiff and RadDiffBench provide the first method-and-benchmark foundation for systematically uncovering meaningful differences in radiological data.
Problem

Research questions and friction points this paper is trying to address.

radiology image comparison
clinical difference description
medical imaging analysis
natural language generation
multimodal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal reasoning
vision-language model
iterative hypothesis refinement
targeted visual search
radiology comparison
🔎 Similar Papers
No similar papers found.