The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work identifies severe training data memorization in text-to-image diffusion models when synthesizing chest X-rays from the MIMIC-CXR dataset—particularly reproducing de-identification artifacts in text prompts (e.g., “no evidence of”, “unremarkable”), newly identified as the strongest memorization cues. Method: We propose a unified framework integrating memory attribution analysis, prompt-level sensitivity evaluation, and token-level memory quantification, complemented by adversarial ablation experiments. Contribution/Results: Existing inference-time mitigation strategies show limited efficacy (<20% reduction) in suppressing memorization; residual de-identification markers significantly exacerbate privacy leakage. We establish the first benchmark of memorized prompts for chest radiograph synthesis and introduce privacy-enhancing practices tailored to medical imaging. Our work provides a reproducible evaluation paradigm and actionable interventions to support compliant, trustworthy synthetic data deployment in healthcare.

Technology Category

Application Category

📝 Abstract

Generative models, particularly text-to-image (T2I) diffusion models, play a crucial role in medical image analysis. However, these models are prone to training data memorization, posing significant risks to patient privacy. Synthetic chest X-ray generation is one of the most common applications in medical image analysis with the MIMIC-CXR dataset serving as the primary data repository for this task. This study adopts a data-driven approach and presents the first systematic attempt to identify prompts and text tokens in MIMIC-CXR that contribute the most to training data memorization. Our analysis reveals an unexpected finding: prompts containing traces of de-identification procedures are among the most memorized, with de-identification markers contributing the most. Furthermore, we also find existing inference-time memorization mitigation strategies are ineffective and fail to sufficiently reduce the model's reliance on memorized text tokens highlighting a broader issue in T2I synthesis with MIMIC-CXR. On this front, we propose actionable strategies to enhance privacy and improve the reliability of generative models in medical imaging. Finally, our results provide a foundation for future work on developing and benchmarking memorization mitigation techniques for synthetic chest X-ray generation using the MIMIC-CXR dataset.

Problem

Research questions and friction points this paper is trying to address.

Identify memorization-prone prompts in MIMIC-CXR

Study de-identification traces' impact on memorization

Propose strategies to mitigate memorization risks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Use MIMIC-CXR for X-ray synthesis

Identify memorization-prone prompts

Propose privacy-enhancing strategies

🔎 Similar Papers

Unconditional Latent Diffusion Models Memorize Patient Imaging Data: Implications for Openly Sharing Synthetic Data