🤖 AI Summary
This study addresses the tendency of existing large vision-language models to produce plausible yet unreliable diagnoses in chest X-ray interpretation—responses often lacking sufficient radiological evidence and exhibiting poor generalization to novel tasks. To overcome these limitations, the authors propose an evidence-driven diagnostic framework that requires no retraining, integrating large language models with a clinical toolchain to enable multi-step, verifiable, and interactive diagnostic reasoning grounded in visual evidence. As part of this work, they introduce CXReasonDial, the first benchmark dataset for multi-turn diagnostic dialogue on chest X-rays, comprising 12 distinct tasks and 1,946 annotated dialogues. Extensive evaluation on this benchmark demonstrates that the proposed approach significantly outperforms current models in both reliability and verifiability.
📝 Abstract
Chest X-ray plays a central role in thoracic diagnosis, and its interpretation inherently requires multi-step, evidence-grounded reasoning. However, large vision-language models (LVLMs) often generate plausible responses that are not faithfully grounded in diagnostic evidence and provide limited visual evidence for verification, while also requiring costly retraining to support new diagnostic tasks, limiting their reliability and adaptability in clinical settings. To address these limitations, we present CXReasonAgent, a diagnostic agent that integrates a large language model (LLM) with clinically grounded diagnostic tools to perform evidence-grounded diagnostic reasoning using image-derived diagnostic and visual evidence. To evaluate these capabilities, we introduce CXReasonDial, a multi-turn dialogue benchmark with 1,946 dialogues across 12 diagnostic tasks, and show that CXReasonAgent produces faithfully grounded responses, enabling more reliable and verifiable diagnostic reasoning than LVLMs. These findings highlight the importance of integrating clinically grounded diagnostic tools, particularly in safety-critical clinical settings.