A Utility-preserving De-identification Pipeline for Cross-hospital Radiology Data Sharing

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study addresses the challenge of balancing privacy preservation and model utility in cross-institutional sharing of radiology images and reports. The authors propose a novel de-identification pipeline that integrates a blacklist of privacy-sensitive terms, a whitelist of pathology-relevant terms, generative image filtering, and report ID removal to synthesize data that retains critical diagnostic information while eliminating personally identifiable elements. Systematic evaluation on a public chest X-ray dataset demonstrates, for the first time, that large vision–language models trained on this de-identified data achieve diagnostic performance comparable to those trained on original data, with substantially reduced re-identification risk. Furthermore, in cross-hospital transfer scenarios, combining local institutional data with the de-identified data further enhances model performance, effectively reconciling clinical utility with robust privacy safeguards.

Technology Category

Application Category

📝 Abstract

Large-scale radiology data are critical for developing robust medical AI systems. However, sharing such data across hospitals remains heavily constrained by privacy concerns. Existing de-identification research in radiology mainly focus on removing identifiable information to enable compliant data release. Yet whether de-identified radiology data can still preserve sufficient utility for large-scale vision-language model training and cross-hospital transfer remains underexplored. In this paper, we introduce a utility-preserving de-identification pipeline (UPDP) for cross-hospital radiology data sharing. Specifically, we compile a blacklist of privacy-sensitive terms and a whitelist of pathology-related terms. For radiology images, we use a generative filtering mechanism that synthesis a privacy-filtered and pathology-reserved counterparts of the original images. These synthetic image counterparts, together with ID-filtered reports, can then be securely shared across hospitals for downstream model development and evaluation. Experiments on public chest X-ray benchmarks demonstrate that our method effectively removes privacy-sensitive information while preserving diagnostically relevant pathology cues. Models trained on the de-identified data maintain competitive diagnostic accuracy compared with those trained on the original data, while exhibiting a marked decline in identity-related accuracy, confirming effective privacy protection. In the cross-hospital setting, we further show that de-identified data can be combined with local data to yield better performance.

Problem

Research questions and friction points this paper is trying to address.

de-identification

radiology data sharing

privacy preservation

data utility

cross-hospital transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

utility-preserving de-identification

generative filtering

cross-hospital data sharing