A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current AI systems for chest X-ray interpretation lack an explicit reasoning process that links visual evidence to diagnostic conclusions, limiting their interpretability and clinical trustworthiness. This work proposes CheXOne, the first model in medical imaging to incorporate a clinically credible mechanism for generating explicit chains of reasoning. Through a two-stage training framework combining instruction fine-tuning and reinforcement learning, CheXOne trains a vision-language foundation model on 14.7 million multitask instruction-reasoning samples to jointly produce diagnostic predictions and interpretable reasoning trajectories. Experiments demonstrate that the model outperforms existing medical and general-purpose large models across 17 zero-shot evaluations. Clinical assessments reveal that 55% of its generated reports meet or exceed the quality of those produced by resident physicians, with reasoning trajectories exhibiting high clinical factuality.
📝 Abstract
Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero-shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general-domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne-drafted reports are comparable to or better than resident-written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI-assisted CXR interpretation.
Problem

Research questions and friction points this paper is trying to address.

chest X-ray interpretation
explainable AI
clinical reasoning
vision-language model
diagnostic transparency
Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning-enabled
vision-language model
chest X-ray interpretation
clinical interpretability
reinforcement learning
🔎 Similar Papers
No similar papers found.
Y
Yabin Zhang
Stanford Center for Artificial Intelligence in Medicine and Imaging, Stanford University, Palo Alto, CA, USA; Department of Radiology, Stanford University, Stanford, CA, USA
Chong Wang
Chong Wang
Stanford University
Trustworthy AIDeep LearningMedical Image Analysis
Yunhe Gao
Yunhe Gao
Stanford University, Rutgers University
Computer VisionMachine LearningMedical Imaging AnalysisVision-Language Model
Jiaming Liu
Jiaming Liu
Postdoc@Stanford, PhD@WUSTL
OptimizationComputational ImagingDeep Learning
Maya Varma
Maya Varma
Stanford University
Computer Science
Justin Xu
Justin Xu
University of Oxford
machine learningnatural language processingelectronic health records
Sophie Ostmeier
Sophie Ostmeier
Stanford University
MLMedicine
J
Jin Long
Stanford Center for Artificial Intelligence in Medicine and Imaging, Stanford University, Palo Alto, CA, USA; Department of Pediatrics, Stanford University, Stanford, CA, USA
Sergios Gatidis
Sergios Gatidis
Stanford Medicine
Healthcare AIMedical Image and Data AnalysisPediatric RadiologyHybrid Imaging
Seena Dehkharghani
Seena Dehkharghani
New York University
Neuroradiology
A
Arne Michalson
Stanford Center for Artificial Intelligence in Medicine and Imaging, Stanford University, Palo Alto, CA, USA
E
Eun Kyoung Hong
Stanford Center for Artificial Intelligence in Medicine and Imaging, Stanford University, Palo Alto, CA, USA; Department of Radiology, Stanford University, Stanford, CA, USA
Christian Bluethgen
Christian Bluethgen
Radiologist, Clinician Scientist, USZ Zurich, AIMI Center, Stanford University
RadiologyThoracic ImagingMultimodal Machine Learning
H
Haiwei Henry Guo
Stanford Center for Artificial Intelligence in Medicine and Imaging, Stanford University, Palo Alto, CA, USA; Department of Radiology, Stanford University, Stanford, CA, USA
A
Alexander Victor Ortiz
Department of Radiology, Stanford University, Stanford, CA, USA
S
Stephan Altmayer
Department of Radiology, Stanford University, Stanford, CA, USA
S
Sandhya Bodapati
Department of Radiology, Stanford University, Stanford, CA, USA
J
Joseph David Janizek
Department of Radiology, Stanford University, Stanford, CA, USA
Ken Chang
Ken Chang
Stanford University
Machine LearningMedical ImagingDistributed Learning
Jean-Benoit Delbrouck
Jean-Benoit Delbrouck
Hugging Face, Stanford
A
Akshay S. Chaudhari
Stanford Center for Artificial Intelligence in Medicine and Imaging, Stanford University, Palo Alto, CA, USA; Department of Radiology, Stanford University, Stanford, CA, USA; Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
Curtis P. Langlotz
Curtis P. Langlotz
Professor of Radiology, Medicine, and Biomedical Data Science, Stanford University
machine learningcomputer visionnatural language processingdecision support systemstechnology assessment