XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current visual-language models (VLMs) lack systematic evaluation of cross-modal interpretability—specifically, the alignment between textual concepts and corresponding visual evidence—in medical imaging. Method: We introduce the first vision-language explanation benchmark for chest X-rays, quantitatively assessing localization reliability across seven CLIP-style VLMs. Our approach generates explanations via cross-attention and similarity mapping, validated against radiologist-annotated ground truth. Results: State-of-the-art models accurately localize large, well-defined pathologies but exhibit markedly poor performance on small or diffuse lesions; moreover, classification accuracy strongly correlates with localization alignment. Domain-specific pretraining significantly improves alignment quality. This work establishes the first rigorous benchmark for evaluating interpretability in medical VLMs, providing a critical empirical foundation for clinical trustworthiness assessment prior to deployment.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) have recently shown remarkable zero-shot performance in medical image understanding, yet their grounding ability, the extent to which textual concepts align with visual evidence, remains underexplored. In the medical domain, however, reliable grounding is essential for interpretability and clinical adoption. In this work, we present the first systematic benchmark for evaluating cross-modal interpretability in chest X-rays across seven CLIP-style VLM variants. We generate visual explanations using cross-attention and similarity-based localization maps, and quantitatively assess their alignment with radiologist-annotated regions across multiple pathologies. Our analysis reveals that: (1) while all VLM variants demonstrate reasonable localization for large and well-defined pathologies, their performance substantially degrades for small or diffuse lesions; (2) models that are pretrained on chest X-ray-specific datasets exhibit improved alignment compared to those trained on general-domain data. (3) The overall recognition ability and grounding ability of the model are strongly correlated. These findings underscore that current VLMs, despite their strong recognition ability, still fall short in clinically reliable grounding, highlighting the need for targeted interpretability benchmarks before deployment in medical practice. XBench code is available at https://github.com/Roypic/Benchmarkingattention
Problem

Research questions and friction points this paper is trying to address.

Evaluating visual-language model grounding in chest X-rays
Assessing alignment between textual concepts and visual evidence
Benchmarking interpretability for reliable clinical deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed cross-attention and similarity-based localization maps
Evaluated seven CLIP-style VLM variants systematically
Assessed alignment with radiologist-annotated pathology regions
🔎 Similar Papers
No similar papers found.