X-ray Made Simple: Radiology Report Generation and Evaluation with Layman's Terms

📅 2024-06-25

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

In radiology report generation (RRG), lexical metrics such as BLEU are vulnerable to template-driven “score inflation,” leading to unreliable evaluation and poor clinical interpretability. Method: We propose a semantics-driven paradigm: (1) constructing the first lay-language terminology dataset for everyday clinical communication; (2) designing a semantic evaluation framework grounded in terminology alignment and BERTScore to expose systematic BLEU biases in medical text; (3) introducing template-decoupled training to reduce model reliance on rigid report templates. Contribution/Results: We identify a positive scaling law between lay-term dataset size and semantic gain. Experiments show >40% reduction in BLEU overestimation, consistent improvements across semantic quality metrics (e.g., BERTScore, clinician ratings), and significantly enhanced clinical readability—demonstrating that semantic fidelity, not lexical surface similarity, is critical for trustworthy RRG evaluation and deployment.

Technology Category

Application Category

📝 Abstract

Radiology Report Generation (RRG) has achieved significant progress with the advancements of multimodal generative models. However, the evaluation in the domain suffers from a lack of fair and robust metrics. We reveal that, high performance on RRG with existing lexical-based metrics (e.g. BLEU) might be more of a mirage - a model can get a high BLEU only by learning the template of reports. This has become an urgent problem for RRG due to the highly patternized nature of these reports. In this work, we un-intuitively approach this problem by proposing the Layman's RRG framework, a layman's terms-based dataset, evaluation and training framework that systematically improves RRG with day-to-day language. We first contribute the translated Layman's terms dataset. Building upon the dataset, we then propose a semantics-based evaluation method, which is proved to mitigate the inflated numbers of BLEU and provides fairer evaluation. Last, we show that training on the layman's terms dataset encourages models to focus on the semantics of the reports, as opposed to overfitting to learning the report templates. We reveal a promising scaling law between the number of training examples and semantics gain provided by our dataset, compared to the inverse pattern brought by the original formats. Our code is available at https://github.com/hegehongcha/LaymanRRG.

Problem

Research questions and friction points this paper is trying to address.

Lack of fair metrics for Radiology Report Generation

Overfitting to report templates in existing models

Need for semantic-focused evaluation and training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layman's terms-based dataset

Semantics-based evaluation method

Training on layman's terms

🔎 Similar Papers

No similar papers found.