🤖 AI Summary
Concept filtering is widely deployed to prevent text-to-image (T2I) models from generating child sexual abuse material (CSAM), yet its effectiveness remains unrigorously evaluated. Method: We propose the first game-theoretic safety evaluation framework, introducing “children wearing glasses” as an ethically grounded proxy metric to quantify filtering vulnerabilities. Through systematic prompt engineering, fine-tuning, and child-debiasing data curation, we assess filtering robustness across both closed- and open-weight T2I models. Contribution/Results: We demonstrate that existing concept filters suffer from fundamental flaws: they fail to eliminate child-related generations entirely, offer negligible protection for open-weight models, and degrade model generalization. Critically, even under ideal filtering—yielding only minimal residual child imagery—adversarial re-injection of CSAM-associated concepts is achievable via lightweight fine-tuning. This work provides the first empirical evidence of structural failure of concept filtering for CSAM mitigation and establishes a reproducible, benchmarkable evaluation paradigm for AI content safety.
📝 Abstract
We evaluate the effectiveness of child filtering to prevent the misuse of text-to-image (T2I) models to create child sexual abuse material (CSAM). First, we capture the complexity of preventing CSAM generation using a game-based security definition. Second, we show that current detection methods cannot remove all children from a dataset. Third, using an ethical proxy for CSAM (a child wearing glasses, hereafter, CWG), we show that even when only a small percentage of child images are left in the training dataset, there exist prompting strategies that generate CWG from a child-filtered T2I model using only a few more queries than when the model is trained on the unfiltered data. Fine-tuning the filtered model on child images further reduces the additional query overhead. We also show that reintroducing a concept is possible via fine-tuning even if filtering is perfect. Our results demonstrate that current filtering methods offer limited protection to closed-weight models and no protection to open-weight models, while reducing the generality of the model by hindering the generation of child-related concepts or changing their representation. We conclude by outlining challenges in conducting evaluations that establish robust evidence on the impact of AI safety mitigations for CSAM.