🤖 AI Summary
This study addresses the critical privacy risks posed by sensitive medical images—such as prenatal ultrasound scans containing personally identifiable information like names and locations—within large-scale public image datasets like LAION-400M. It presents the first systematic investigation into the presence of such high-risk content in general-purpose datasets used for training generative models. By leveraging CLIP embedding similarity search, image content analysis, and named entity recognition, the authors successfully retrieved thousands of ultrasound images from LAION-400M that expose identifiable patient information. The findings demonstrate the widespread inclusion of sensitive medical data in training corpora, highlighting significant privacy vulnerabilities. Based on this empirical evidence, the work proposes concrete recommendations for privacy-preserving dataset curation and usage, offering a foundation for responsible data governance in AI development.
📝 Abstract
The rise of generative models has led to increased use of large-scale datasets collected from the internet, often with minimal or no data curation. This raises concerns about the inclusion of sensitive or private information. In this work, we explore the presence of pregnancy ultrasound images, which contain sensitive personal information and are often shared online. Through a systematic examination of LAION-400M dataset using CLIP embedding similarity, we retrieve images containing pregnancy ultrasound and detect thousands of entities of private information such as names and locations. Our findings reveal that multiple images have high-risk information that could enable re-identification or impersonation. We conclude with recommended practices for dataset curation, data privacy, and ethical use of public image datasets.