ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays

📅 2025-07-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A critical global shortage of radiologists exacerbates the imbalance between demand and supply in medical image diagnosis. To address this, we propose a vision–language collaborative framework that—uniquely—integrates the EVA Vision Transformer with the Llama 2 large language model (LLM) for end-to-end disease classification and lesion localization from chest X-ray images. Leveraging task-specific prompt engineering and transfer learning, our method encodes visual features into semantic tokens fed into the LLM, which jointly generates interpretable diagnostic reports and bounding boxes for pathological regions. Evaluated on the VinDr-CXR dataset, the framework achieves an F1 score of 0.76, demonstrating substantial improvements in both diagnostic efficiency and interpretability. This work establishes a novel paradigm for LLM-augmented medical imaging analysis, combining clinical applicability with methodological innovation.

Technology Category

Application Category

📝 Abstract
The global demand for radiologists is increasing rapidly due to a growing reliance on medical imaging services, while the supply of radiologists is not keeping pace. Advances in computer vision and image processing technologies present significant potential to address this gap by enhancing radiologists' capabilities and improving diagnostic accuracy. Large language models (LLMs), particularly generative pre-trained transformers (GPTs), have become the primary approach for understanding and generating textual data. In parallel, vision transformers (ViTs) have proven effective at converting visual data into a format that LLMs can process efficiently. In this paper, we present ChestGPT, a deep-learning framework that integrates the EVA ViT with the Llama 2 LLM to classify diseases and localize regions of interest in chest X-ray images. The ViT converts X-ray images into tokens, which are then fed, together with engineered prompts, into the LLM, enabling joint classification and localization of diseases. This approach incorporates transfer learning techniques to enhance both explainability and performance. The proposed method achieved strong global disease classification performance on the VinDr-CXR dataset, with an F1 score of 0.76, and successfully localized pathologies by generating bounding boxes around the regions of interest. We also outline several task-specific prompts, in addition to general-purpose prompts, for scenarios radiologists might encounter. Overall, this framework offers an assistive tool that can lighten radiologists' workload by providing preliminary findings and regions of interest to facilitate their diagnostic process.
Problem

Research questions and friction points this paper is trying to address.

Detect and localize diseases in chest X-rays using AI
Combine vision transformers and large language models for medical imaging
Improve radiologists' efficiency with automated preliminary findings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates EVA ViT with Llama 2 LLM
Converts X-ray images into tokens
Uses transfer learning for better performance
🔎 Similar Papers
No similar papers found.
Shehroz S. Khan
Shehroz S. Khan
American University of the Middle East, Kuwait
One-class ClassificationDeep LearningAgingRehabilitationMultimodal Sensors
P
Petar Przulj
Faculty of Applied Science and Engineering, University of Toronto
A
Ahmed Ashraf
Department of Electrical and Computer Engineering, University of Manitoba
A
Ali Abedi
KITE Research Institute, Toronto Rehabilitation Institute, University Health Network