Neglected Risks: The Disturbing Reality of Children's Images in Datasets and the Urgent Call for Accountability

📅 2025-04-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work exposes systemic ethical risks—including privacy violations, lack of informed consent, and absence of accountability—arising from the pervasive presence of children’s images in AI training datasets. Addressing the challenges of detecting and filtering such images in existing large-scale datasets (e.g., Open Images V7), we propose the first open-source, reproducible, ethics-first detection-and-removal pipeline. Our approach integrates fine-tuned vision-language models, facial analysis, age estimation, and context-aware semantic filtering to robustly identify child-associated content. Evaluated on the #PraCegoVer benchmark and an Open Images subset, it achieves high recall—effectively identifying child-related images among 100,000 samples. Empirical validation demonstrates its practical utility in data purification for visual question answering (VQA) tasks. Beyond technical contribution, this work advances normative frameworks for child data governance and catalyzes a cross-institutional ethical action initiative.

Technology Category

Application Category

📝 Abstract
Including children's images in datasets has raised ethical concerns, particularly regarding privacy, consent, data protection, and accountability. These datasets, often built by scraping publicly available images from the Internet, can expose children to risks such as exploitation, profiling, and tracking. Despite the growing recognition of these issues, approaches for addressing them remain limited. We explore the ethical implications of using children's images in AI datasets and propose a pipeline to detect and remove such images. As a use case, we built the pipeline on a Vision-Language Model under the Visual Question Answering task and tested it on the #PraCegoVer dataset. We also evaluate the pipeline on a subset of 100,000 images from the Open Images V7 dataset to assess its effectiveness in detecting and removing images of children. The pipeline serves as a baseline for future research, providing a starting point for more comprehensive tools and methodologies. While we leverage existing models trained on potentially problematic data, our goal is to expose and address this issue. We do not advocate for training or deploying such models, but instead call for urgent community reflection and action to protect children's rights. Ultimately, we aim to encourage the research community to exercise - more than an additional - care in creating new datasets and to inspire the development of tools to protect the fundamental rights of vulnerable groups, particularly children.
Problem

Research questions and friction points this paper is trying to address.

Ethical concerns of children's images in AI datasets
Lack of methods to detect and remove children's images
Need for tools to protect children's privacy rights
Innovation

Methods, ideas, or system contributions that make the work stand out.

Propose pipeline to detect children's images
Test pipeline on Vision-Language Model dataset
Evaluate effectiveness on Open Images subset
🔎 Similar Papers
No similar papers found.