🤖 AI Summary
This study addresses the low precision in protected health information (PHI) detection and poor cross-institutional generalization in radiology report de-identification. We propose a Transformer-based large-scale training framework comprising: (1) domain-adaptive fine-tuning using the Stanford Large Radiology Corpus; (2) a novel “hide-in-plain-sight” synthetic PHI generation technique to produce high-fidelity real and synthetic labeled data; and (3) token-level fine-grained evaluation of PHI recognition performance. Our contributions include the first demonstration of an academic model outperforming leading commercial cloud services (F1 = 0.632–0.754) across multi-center test sets—achieving F1 scores of 0.973 (University of Pennsylvania), 0.996 (Stanford), and 0.959 (synthetic data). This advancement significantly improves robustness and privacy protection efficacy, establishing a new state-of-the-art benchmark for clinical text de-identification.
📝 Abstract
Objective: To enhance automated de-identification of radiology reports by scaling transformer-based models through extensive training datasets and benchmarking performance against commercial cloud vendor systems for protected health information (PHI) detection. Materials and Methods: In this retrospective study, we built upon a state-of-the-art, transformer-based, PHI de-identification pipeline by fine-tuning on two large annotated radiology corpora from Stanford University, encompassing chest X-ray, chest CT, abdomen/pelvis CT, and brain MR reports and introducing an additional PHI category (AGE) into the architecture. Model performance was evaluated on test sets from Stanford and the University of Pennsylvania (Penn) for token-level PHI detection. We further assessed (1) the stability of synthetic PHI generation using a"hide-in-plain-sight"method and (2) performance against commercial systems. Precision, recall, and F1 scores were computed across all PHI categories. Results: Our model achieved overall F1 scores of 0.973 on the Penn dataset and 0.996 on the Stanford dataset, outperforming or maintaining the previous state-of-the-art model performance. Synthetic PHI evaluation showed consistent detectability (overall F1: 0.959 [0.958-0.960]) across 50 independently de-identified Penn datasets. Our model outperformed all vendor systems on synthetic Penn reports (overall F1: 0.960 vs. 0.632-0.754). Discussion: Large-scale, multimodal training improved cross-institutional generalization and robustness. Synthetic PHI generation preserved data utility while ensuring privacy. Conclusion: A transformer-based de-identification model trained on diverse radiology datasets outperforms prior academic and commercial systems in PHI detection and establishes a new benchmark for secure clinical text processing.