A Utility-preserving De-identification Pipeline for Cross-hospital Radiology Data Sharing

๐Ÿ“… 2026-04-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the challenge of balancing privacy preservation and model utility in cross-institutional sharing of radiology images and reports. The authors propose a novel de-identification pipeline that integrates a blacklist of privacy-sensitive terms, a whitelist of pathology-relevant terms, generative image filtering, and report ID removal to synthesize data that retains critical diagnostic information while eliminating personally identifiable elements. Systematic evaluation on a public chest X-ray dataset demonstrates, for the first time, that large visionโ€“language models trained on this de-identified data achieve diagnostic performance comparable to those trained on original data, with substantially reduced re-identification risk. Furthermore, in cross-hospital transfer scenarios, combining local institutional data with the de-identified data further enhances model performance, effectively reconciling clinical utility with robust privacy safeguards.
๐Ÿ“ Abstract
Large-scale radiology data are critical for developing robust medical AI systems. However, sharing such data across hospitals remains heavily constrained by privacy concerns. Existing de-identification research in radiology mainly focus on removing identifiable information to enable compliant data release. Yet whether de-identified radiology data can still preserve sufficient utility for large-scale vision-language model training and cross-hospital transfer remains underexplored. In this paper, we introduce a utility-preserving de-identification pipeline (UPDP) for cross-hospital radiology data sharing. Specifically, we compile a blacklist of privacy-sensitive terms and a whitelist of pathology-related terms. For radiology images, we use a generative filtering mechanism that synthesis a privacy-filtered and pathology-reserved counterparts of the original images. These synthetic image counterparts, together with ID-filtered reports, can then be securely shared across hospitals for downstream model development and evaluation. Experiments on public chest X-ray benchmarks demonstrate that our method effectively removes privacy-sensitive information while preserving diagnostically relevant pathology cues. Models trained on the de-identified data maintain competitive diagnostic accuracy compared with those trained on the original data, while exhibiting a marked decline in identity-related accuracy, confirming effective privacy protection. In the cross-hospital setting, we further show that de-identified data can be combined with local data to yield better performance.
Problem

Research questions and friction points this paper is trying to address.

de-identification
radiology data sharing
privacy preservation
data utility
cross-hospital transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

utility-preserving de-identification
generative filtering
cross-hospital data sharing
radiology data
privacy-preserving AI
๐Ÿ”Ž Similar Papers
No similar papers found.