🤖 AI Summary
Current chest X-ray (CXR) diagnostic models exhibit inflated performance due to reliance on clinical context shortcuts—particularly discharge summaries—rather than genuine visual features, leading to significant performance degradation on cases with high prior probability. Method: We propose the first evaluation paradigm that constructs label-level prior probabilities from clinical text (e.g., discharge notes) via NLP-based context extraction, designs a balanced test set to mitigate shortcut bias, and conducts systematic ablation studies. Contribution/Results: Experiments demonstrate that mainstream models suffer substantial performance drops when contextual cues are removed; mean accuracy fails to reflect true visual diagnostic capability. This work shifts medical AI evaluation from “black-box average metrics” toward “causal robustness validation,” establishing a methodological foundation for trustworthy clinical deployment.
📝 Abstract
Public healthcare datasets of Chest X-Rays (CXRs) have long been a popular benchmark for developing computer vision models in healthcare. However, strong average-case performance of machine learning (ML) models on these datasets is insufficient to certify their clinical utility. In this paper, we use clinical context, as captured by prior discharge summaries, to provide a more holistic evaluation of current ``state-of-the-art'' models for the task of CXR diagnosis. Using discharge summaries recorded prior to each CXR, we derive a ``prior'' or ``pre-test'' probability of each CXR label, as a proxy for existing contextual knowledge available to clinicians when interpreting CXRs. Using this measure, we demonstrate two key findings: First, for several diagnostic labels, CXR models tend to perform best on cases where the pre-test probability is very low, and substantially worse on cases where the pre-test probability is higher. Second, we use pre-test probability to assess whether strong average-case performance reflects true diagnostic signal, rather than an ability to infer the pre-test probability as a shortcut. We find that performance drops sharply on a balanced test set where this shortcut does not exist, which may indicate that much of the apparent diagnostic power derives from inferring this clinical context. We argue that this style of analysis, using context derived from clinical notes, is a promising direction for more rigorous and fine-grained evaluation of clinical vision models.