Improving the Performance of Radiology Report De-identification with Large-Scale Training and Benchmarking Against Cloud Vendor Methods

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the low precision in protected health information (PHI) detection and poor cross-institutional generalization in radiology report de-identification. We propose a Transformer-based large-scale training framework comprising: (1) domain-adaptive fine-tuning using the Stanford Large Radiology Corpus; (2) a novel “hide-in-plain-sight” synthetic PHI generation technique to produce high-fidelity real and synthetic labeled data; and (3) token-level fine-grained evaluation of PHI recognition performance. Our contributions include the first demonstration of an academic model outperforming leading commercial cloud services (F1 = 0.632–0.754) across multi-center test sets—achieving F1 scores of 0.973 (University of Pennsylvania), 0.996 (Stanford), and 0.959 (synthetic data). This advancement significantly improves robustness and privacy protection efficacy, establishing a new state-of-the-art benchmark for clinical text de-identification.

Technology Category

Application Category

📝 Abstract
Objective: To enhance automated de-identification of radiology reports by scaling transformer-based models through extensive training datasets and benchmarking performance against commercial cloud vendor systems for protected health information (PHI) detection. Materials and Methods: In this retrospective study, we built upon a state-of-the-art, transformer-based, PHI de-identification pipeline by fine-tuning on two large annotated radiology corpora from Stanford University, encompassing chest X-ray, chest CT, abdomen/pelvis CT, and brain MR reports and introducing an additional PHI category (AGE) into the architecture. Model performance was evaluated on test sets from Stanford and the University of Pennsylvania (Penn) for token-level PHI detection. We further assessed (1) the stability of synthetic PHI generation using a"hide-in-plain-sight"method and (2) performance against commercial systems. Precision, recall, and F1 scores were computed across all PHI categories. Results: Our model achieved overall F1 scores of 0.973 on the Penn dataset and 0.996 on the Stanford dataset, outperforming or maintaining the previous state-of-the-art model performance. Synthetic PHI evaluation showed consistent detectability (overall F1: 0.959 [0.958-0.960]) across 50 independently de-identified Penn datasets. Our model outperformed all vendor systems on synthetic Penn reports (overall F1: 0.960 vs. 0.632-0.754). Discussion: Large-scale, multimodal training improved cross-institutional generalization and robustness. Synthetic PHI generation preserved data utility while ensuring privacy. Conclusion: A transformer-based de-identification model trained on diverse radiology datasets outperforms prior academic and commercial systems in PHI detection and establishes a new benchmark for secure clinical text processing.
Problem

Research questions and friction points this paper is trying to address.

Enhancing automated de-identification of radiology reports for privacy protection
Benchmarking transformer-based models against commercial cloud vendor PHI detection systems
Improving cross-institutional generalization through large-scale multimodal training datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned transformer models on large radiology corpora
Introduced AGE category to enhance PHI detection
Outperformed commercial systems using synthetic data evaluation
🔎 Similar Papers
No similar papers found.
E
Eva Prakash
Stanford University
M
Maayane Attias
JP Morgan Chase & Co
P
Pierre J. Chambon
Sorbonne University
J
Justin Xu
University of Oxford
S
S. Truong
NVIDIA
Jean-Benoit Delbrouck
Jean-Benoit Delbrouck
Hugging Face, Stanford
T
Tessa Cook
University of Pennsylvania
Curtis P. Langlotz
Curtis P. Langlotz
Professor of Radiology, Medicine, and Biomedical Data Science, Stanford University
machine learningcomputer visionnatural language processingdecision support systemstechnology assessment