🤖 AI Summary
This work addresses the robustness and scalability challenges in content moderation of synthetic versus natural images. We propose the first dedicated multimodal moderation model for synthetic imagery, built upon the Gemma-3 (4B-parameter) architecture as a vision-language joint model. Our method introduces an adversarial data generation pipeline enabling controllable, diverse, and semantically consistent synthesis of harmful images; integrates strategy-driven fine-tuning, adversarial sample augmentation, and multi-source harm classification via joint training; and achieves fine-grained detection across three critical harm categories: pornography, violence/gore, and hazardous content. Evaluated on both internal and external benchmarks, our model surpasses state-of-the-art baselines—including LLaVAGuard, GPT-4o mini, and the base Gemma-3—establishing new SOTA performance in synthetic image moderation. Notably, we release the first open-source, synthetic-image-specific moderation model, advancing practical multilingual and multimodal AI safety governance.
📝 Abstract
We introduce ShieldGemma 2, a 4B parameter image content moderation model built on Gemma 3. This model provides robust safety risk predictions across the following key harm categories: Sexually Explicit, Violence &Gore, and Dangerous Content for synthetic images (e.g. output of any image generation model) and natural images (e.g. any image input to a Vision-Language Model). We evaluated on both internal and external benchmarks to demonstrate state-of-the-art performance compared to LlavaGuard citep{helff2024llavaguard}, GPT-4o mini citep{hurst2024gpt}, and the base Gemma 3 model citep{gemma_2025} based on our policies. Additionally, we present a novel adversarial data generation pipeline which enables a controlled, diverse, and robust image generation. ShieldGemma 2 provides an open image moderation tool to advance multimodal safety and responsible AI development.