🤖 AI Summary
To address the critical issue that watermarking techniques degrade pathological details and compromise diagnostic reliability in text-to-medical-image synthesis, this paper proposes MedSign—a lesion-aware adaptive watermarking framework. Methodologically, MedSign introduces, for the first time, a cross-modal attention-based mechanism to generate lesion localization maps; during diffusion decoding, it dynamically attenuates watermark intensity in lesion regions to achieve spatially adaptive embedding. Furthermore, it enhances watermark robustness via fine-tuning the Latent Diffusion Model (LDM) decoder and optimizing multi-layer, multi-head cross-timestep text–image attention aggregation. Evaluated on MIMIC-CXR and OIA-ODIR, MedSign achieves state-of-the-art watermark detection accuracy (>99.2%), significantly improves image fidelity (FID reduced by 18.7%), and preserves 98.4% diagnostic consistency over pathological regions—marking the first framework to jointly optimize forensic robustness and clinical diagnostic integrity in medical diffusion models.
📝 Abstract
As recent text-conditioned diffusion models have enabled the generation of high-quality images, concerns over their potential misuse have also grown. This issue is critical in the medical domain, where text-conditioned generated medical images could enable insurance fraud or falsified records, highlighting the urgent need for reliable safeguards against unethical use. While watermarking techniques have emerged as a promising solution in general image domains, their direct application to medical imaging presents significant challenges. A key challenge is preserving fine-grained disease manifestations, as even minor distortions from a watermark may lead to clinical misinterpretation, which compromises diagnostic integrity. To overcome this gap, we present MedSign, a deep learning-based watermarking framework specifically designed for text-to-medical image synthesis, which preserves pathologically significant regions by adaptively adjusting watermark strength. Specifically, we generate a pathology localization map using cross-attention between medical text tokens and the diffusion denoising network, aggregating token-wise attention across layers, heads, and time steps. Leveraging this map, we optimize the LDM decoder to incorporate watermarking during image synthesis, ensuring cohesive integration while minimizing interference in diagnostically critical regions. Experimental results show that our MedSign preserves diagnostic integrity while ensuring watermark robustness, achieving state-of-the-art performance in image quality and detection accuracy on MIMIC-CXR and OIA-ODIR datasets.