🤖 AI Summary
Inconsistent radiology report formatting hinders clinical interpretability and impedes machine learning adoption; existing large language models (LLMs) face deployment barriers due to high computational cost, low transparency, and privacy risks. This work proposes a lightweight encoder-decoder model (<300M parameters) built upon T5/BERT2BERT architecture, integrating prefix tuning, in-context learning, and LoRA-based fine-tuning. It is rigorously evaluated on MIMIC-CXR and CheXpert Plus for structured report generation. Results demonstrate, for the first time, that this lightweight model outperforms open-source LLMs adapted via prompt engineering or ICL across multiple metrics—including BLEU, ROUGE-L, BERTScore, RadGraph, GREEN, and SRR-BERT. Notably, LoRA-finetuning LLaMA-3-70B yields only a marginal +4.3% F1 improvement on the Findings section (SRR-BERT), yet incurs >400× higher inference latency, cost, and carbon emissions. This work establishes a new clinical NLP paradigm: resource-efficient, highly interpretable, and environmentally sustainable.
📝 Abstract
Radiology reports are critical for clinical decision-making but often lack a standardized format, limiting both human interpretability and machine learning (ML) applications. While large language models (LLMs) have shown strong capabilities in reformatting clinical text, their high computational requirements, lack of transparency, and data privacy concerns hinder practical deployment. To address these challenges, we explore lightweight encoder-decoder models (<300M parameters)-specifically T5 and BERT2BERT-for structuring radiology reports from the MIMIC-CXR and CheXpert Plus datasets. We benchmark these models against eight open-source LLMs (1B-70B), adapted using prefix prompting, in-context learning (ICL), and low-rank adaptation (LoRA) finetuning. Our best-performing lightweight model outperforms all LLMs adapted using prompt-based techniques on a human-annotated test set. While some LoRA-finetuned LLMs achieve modest gains over the lightweight model on the Findings section (BLEU 6.4%, ROUGE-L 4.8%, BERTScore 3.6%, F1-RadGraph 1.1%, GREEN 3.6%, and F1-SRR-BERT 4.3%), these improvements come at the cost of substantially greater computational resources. For example, LLaMA-3-70B incurred more than 400 times the inference time, cost, and carbon emissions compared to the lightweight model. These results underscore the potential of lightweight, task-specific models as sustainable and privacy-preserving solutions for structuring clinical text in resource-constrained healthcare settings.