A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Current AI systems for chest X-ray interpretation lack an explicit reasoning process that links visual evidence to diagnostic conclusions, limiting their interpretability and clinical trustworthiness. This work proposes CheXOne, the first model in medical imaging to incorporate a clinically credible mechanism for generating explicit chains of reasoning. Through a two-stage training framework combining instruction fine-tuning and reinforcement learning, CheXOne trains a vision-language foundation model on 14.7 million multitask instruction-reasoning samples to jointly produce diagnostic predictions and interpretable reasoning trajectories. Experiments demonstrate that the model outperforms existing medical and general-purpose large models across 17 zero-shot evaluations. Clinical assessments reveal that 55% of its generated reports meet or exceed the quality of those produced by resident physicians, with reasoning trajectories exhibiting high clinical factuality.

Technology Category

Application Category

📝 Abstract

Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero-shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general-domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne-drafted reports are comparable to or better than resident-written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI-assisted CXR interpretation.

Problem

Research questions and friction points this paper is trying to address.

chest X-ray interpretation

explainable AI

clinical reasoning

vision-language model

diagnostic transparency

Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning-enabled

vision-language model

chest X-ray interpretation