ReEvalMed: Rethinking Medical Report Evaluation by Aligning Metrics with Real-World Clinical Judgment

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing automatic evaluation metrics achieve high scores on radiology report generation tasks yet lack clinical trust due to fundamental deficiencies in clinical semantic understanding—specifically, their inability to distinguish clinically significant errors, over-penalization of harmless lexical or syntactic variations, and non-monotonic responses to error severity. Method: We propose the first clinically oriented meta-evaluation framework, defining dual dimensions—clinical alignment and core capability—and constructing a manually annotated dataset featuring fine-grained error types, clinical importance ratings, and expert explanations. Contribution/Results: Through systematic benchmarking of mainstream metrics, we empirically expose their semantic blind spots. Our framework not only precisely diagnoses failure mechanisms of current metrics but also establishes a methodological foundation and practical paradigm for trustworthy, interpretable evaluation of medical text generation.

Technology Category

Application Category

📝 Abstract

Automatically generated radiology reports often receive high scores from existing evaluation metrics but fail to earn clinicians' trust. This gap reveals fundamental flaws in how current metrics assess the quality of generated reports. We rethink the design and evaluation of these metrics and propose a clinically grounded Meta-Evaluation framework. We define clinically grounded criteria spanning clinical alignment and key metric capabilities, including discrimination, robustness, and monotonicity. Using a fine-grained dataset of ground truth and rewritten report pairs annotated with error types, clinical significance labels, and explanations, we systematically evaluate existing metrics and reveal their limitations in interpreting clinical semantics, such as failing to distinguish clinically significant errors, over-penalizing harmless variations, and lacking consistency across error severity levels. Our framework offers guidance for building more clinically reliable evaluation methods.

Problem

Research questions and friction points this paper is trying to address.

Bridging the gap between automated metrics and clinical judgment in radiology reports

Addressing limitations of current metrics in interpreting clinical semantics

Developing clinically reliable evaluation methods for medical report generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes clinically grounded Meta-Evaluation framework

Defines criteria spanning clinical alignment and capabilities

Systematically evaluates metrics using fine-grained annotated dataset

🔎 Similar Papers

No similar papers found.