ShieldGemma 2: Robust and Tractable Image Content Moderation

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the robustness and scalability challenges in content moderation of synthetic versus natural images. We propose the first dedicated multimodal moderation model for synthetic imagery, built upon the Gemma-3 (4B-parameter) architecture as a vision-language joint model. Our method introduces an adversarial data generation pipeline enabling controllable, diverse, and semantically consistent synthesis of harmful images; integrates strategy-driven fine-tuning, adversarial sample augmentation, and multi-source harm classification via joint training; and achieves fine-grained detection across three critical harm categories: pornography, violence/gore, and hazardous content. Evaluated on both internal and external benchmarks, our model surpasses state-of-the-art baselines—including LLaVAGuard, GPT-4o mini, and the base Gemma-3—establishing new SOTA performance in synthetic image moderation. Notably, we release the first open-source, synthetic-image-specific moderation model, advancing practical multilingual and multimodal AI safety governance.

Technology Category

Application Category

📝 Abstract
We introduce ShieldGemma 2, a 4B parameter image content moderation model built on Gemma 3. This model provides robust safety risk predictions across the following key harm categories: Sexually Explicit, Violence &Gore, and Dangerous Content for synthetic images (e.g. output of any image generation model) and natural images (e.g. any image input to a Vision-Language Model). We evaluated on both internal and external benchmarks to demonstrate state-of-the-art performance compared to LlavaGuard citep{helff2024llavaguard}, GPT-4o mini citep{hurst2024gpt}, and the base Gemma 3 model citep{gemma_2025} based on our policies. Additionally, we present a novel adversarial data generation pipeline which enables a controlled, diverse, and robust image generation. ShieldGemma 2 provides an open image moderation tool to advance multimodal safety and responsible AI development.
Problem

Research questions and friction points this paper is trying to address.

Develops robust image content moderation model
Detects harmful content in synthetic and natural images
Advances multimodal safety and responsible AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

4B parameter image moderation model
novel adversarial data generation pipeline
open tool for multimodal safety
🔎 Similar Papers
No similar papers found.