GEMA-Score: Granular Explainable Multi-Agent Score for Radiology Report Evaluation

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing medical report generation evaluation metrics focus solely on keyword coverage, neglecting clinically critical fine-grained aspects—such as lesion location, severity, and diagnostic uncertainty—leading to incomplete reliability assessment. To address this, we propose the first multi-agent collaborative framework for trustworthy radiology report evaluation. Our method integrates anatomical- and lesion-level fine-grained parsing via named entity recognition (NER-F1) and introduces an LLM-driven subjective dimension scoring module, enabling structured feedback and clinical interpretability. Evaluated on Rexval and RadEvalX, our framework achieves Kendall correlation coefficients of 0.70 and 0.54 with expert ratings—substantially outperforming prevailing metrics. The code and interactive demo system are publicly available.

Technology Category

Application Category

📝 Abstract

Automatic medical report generation supports clinical diagnosis, reduces the workload of radiologists, and holds the promise of improving diagnosis consistency. However, existing evaluation metrics primarily assess the accuracy of key medical information coverage in generated reports compared to human-written reports, while overlooking crucial details such as the location and certainty of reported abnormalities. These limitations hinder the comprehensive assessment of the reliability of generated reports and pose risks in their selection for clinical use. Therefore, we propose a Granular Explainable Multi-Agent Score (GEMA-Score) in this paper, which conducts both objective quantification and subjective evaluation through a large language model-based multi-agent workflow. Our GEMA-Score parses structured reports and employs NER-F1 calculations through interactive exchanges of information among agents to assess disease diagnosis, location, severity, and uncertainty. Additionally, an LLM-based scoring agent evaluates completeness, readability, and clinical terminology while providing explanatory feedback. Extensive experiments validate that GEMA-Score achieves the highest correlation with human expert evaluations on a public dataset, demonstrating its effectiveness in clinical scoring (Kendall coefficient = 0.70 for Rexval dataset and Kendall coefficient = 0.54 for RadEvalX dataset). The anonymous project demo is available at: https://github.com/Zhenxuan-Zhang/GEMA_score.

Problem

Research questions and friction points this paper is trying to address.

Evaluates accuracy of medical report generation

Assesses disease location, severity, and uncertainty

Improves clinical report reliability and selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent workflow for report evaluation

NER-F1 calculations assess disease details

LLM-based scoring agent provides feedback

🔎 Similar Papers

GREEN: Generative Radiology Report Evaluation and Error Notation