Revisiting Performance Claims for Chest X-Ray Models Using Clinical Context

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current chest X-ray (CXR) diagnostic models exhibit inflated performance due to reliance on clinical context shortcuts—particularly discharge summaries—rather than genuine visual features, leading to significant performance degradation on cases with high prior probability. Method: We propose the first evaluation paradigm that constructs label-level prior probabilities from clinical text (e.g., discharge notes) via NLP-based context extraction, designs a balanced test set to mitigate shortcut bias, and conducts systematic ablation studies. Contribution/Results: Experiments demonstrate that mainstream models suffer substantial performance drops when contextual cues are removed; mean accuracy fails to reflect true visual diagnostic capability. This work shifts medical AI evaluation from “black-box average metrics” toward “causal robustness validation,” establishing a methodological foundation for trustworthy clinical deployment.

Technology Category

Application Category

📝 Abstract
Public healthcare datasets of Chest X-Rays (CXRs) have long been a popular benchmark for developing computer vision models in healthcare. However, strong average-case performance of machine learning (ML) models on these datasets is insufficient to certify their clinical utility. In this paper, we use clinical context, as captured by prior discharge summaries, to provide a more holistic evaluation of current ``state-of-the-art'' models for the task of CXR diagnosis. Using discharge summaries recorded prior to each CXR, we derive a ``prior'' or ``pre-test'' probability of each CXR label, as a proxy for existing contextual knowledge available to clinicians when interpreting CXRs. Using this measure, we demonstrate two key findings: First, for several diagnostic labels, CXR models tend to perform best on cases where the pre-test probability is very low, and substantially worse on cases where the pre-test probability is higher. Second, we use pre-test probability to assess whether strong average-case performance reflects true diagnostic signal, rather than an ability to infer the pre-test probability as a shortcut. We find that performance drops sharply on a balanced test set where this shortcut does not exist, which may indicate that much of the apparent diagnostic power derives from inferring this clinical context. We argue that this style of analysis, using context derived from clinical notes, is a promising direction for more rigorous and fine-grained evaluation of clinical vision models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating chest X-ray models using clinical context from discharge summaries
Assessing whether strong performance reflects true diagnostic signal or shortcuts
Providing more rigorous evaluation of clinical vision models using pre-test probabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using prior discharge summaries for pre-test probability
Evaluating model performance across pre-test probability levels
Balanced test set to remove clinical context shortcuts
🔎 Similar Papers
No similar papers found.