Measuring What VLMs Don't Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models for radiology report generation overly rely on superficial text similarity metrics, which obscure critical issues such as missing clinical terminology and templated outputs, often yielding high-scoring reports with limited clinical utility. To address this, this work proposes two novel evaluation metrics—Clinical Association Displacement (CAD) and Weighted Association Erasure (WAE)—which systematically assess the impact of different decoding strategies on clinical specificity and demographic fairness through lexical diversity analysis, modeling of word association shifts, and weighted semantic erasure computation. The study reveals that deterministic decoding induces severe semantic erasure, while stochastic sampling, although enhancing diversity, introduces new biases. These findings expose significant blind spots in current evaluation frameworks and call for a redefinition of what constitutes an “optimal” radiology report.

Technology Category

Application Category

📝 Abstract
Reliable deployment of Vision-Language Models (VLMs) in radiology requires validation metrics that go beyond surface-level text similarity to ensure clinical fidelity and demographic fairness. This paper investigates a critical blind spot in current model evaluation: the use of decoding strategies that lead to high aggregate token-overlap scores despite succumbing to template collapse, in which models generate only repetitive, safe generic text and omit clinical terminology. Unaddressed, this blind spot can lead to metric gaming, where models that perform well on benchmarks prove clinically uninformative. Instead, we advocate for lexical diversity measures to check model generations for clinical specificity. We introduce Clinical Association Displacement (CAD), a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports. Weighted Association Erasure (WAE) aggregates these shifts to measure the clinical signal loss across demographic groups. We show that deterministic decoding produces high levels of semantic erasure, while stochastic sampling generates diverse outputs but risks introducing new bias, motivating a fundamental rethink of how "optimal" reporting is defined.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
Radiology Report Generation
Clinical Terminology Erasure
Validation Metrics
Demographic Fairness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Clinical Association Displacement
Weighted Association Erasure
template collapse
lexical diversity
demographic fairness
🔎 Similar Papers
2024-05-06Conference on Empirical Methods in Natural Language ProcessingCitations: 11