๐ค AI Summary
This study addresses chest X-ray (CXR) visual question answering (VQA) by unifying two clinically critical tasks: single-image abnormality detection and dual-temporal image difference analysis. We propose a two-stage, report-driven framework: first, a multimodal Transformer generates radiology reports from input images; second, these self-generated reports are dynamically injected as interpretable evidence into the answer generation module, enabling report-grounded, Chain-of-Thoughtโstyle reasoning. To our knowledge, this is the first unified VQA architecture supporting both single-image and differential-image tasks, and the first to leverage self-generated radiology reports as an explicit, explainable intermediate representation for VQA inference. Evaluated on the Medical-Diff-VQA benchmark, our method achieves state-of-the-art performance, significantly improving accuracy and clinical interpretability across both tasks.
๐ Abstract
We present a novel approach to Chest X-ray (CXR) Visual Question Answering (VQA), addressing both single-image image-difference questions. Single-image questions focus on abnormalities within a specific CXR ("What abnormalities are seen in image X?"), while image-difference questions compare two longitudinal CXRs acquired at different time points ("What are the differences between image X and Y?"). We further explore how the integration of radiology reports can enhance the performance of VQA models. While previous approaches have demonstrated the utility of radiology reports during the pre-training phase, we extend this idea by showing that the reports can also be leveraged as additional input to improve the VQA model's predicted answers. First, we propose a unified method that handles both types of questions and auto-regressively generates the answers. For single-image questions, the model is provided with a single CXR. For image-difference questions, the model is provided with two CXRs from the same patient, captured at different time points, enabling the model to detect and describe temporal changes. Taking inspiration from 'Chain-of-Thought reasoning', we demonstrate that performance on the CXR VQA task can be improved by grounding the answer generator module with a radiology report predicted for the same CXR. In our approach, the VQA model is divided into two steps: i) Report Generation (RG) and ii) Answer Generation (AG). Our results demonstrate that incorporating predicted radiology reports as evidence to the AG model enhances performance on both single-image and image-difference questions, achieving state-of-the-art results on the Medical-Diff-VQA dataset.