ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Current text-to-image-to-video (TI2V) generation models face compound multimodal safety risks: harmful content may originate from either unimodal inputs or their cross-modal interactions. Mainstream safety approaches are largely text-centric, rely on predefined risk categories, or perform only post-hoc filtering—thus failing to proactively suppress risks during generation. This paper introduces ConceptGuard, the first end-to-end proactive safety framework tailored for TI2V generation. Its core components are: (1) a multimodal concept-space risk detection module leveraging CLIP and contrastive learning; and (2) a semantic suppression mechanism embedded within the generator, enabling fine-grained risk mitigation in the latent space. Evaluated on our newly constructed benchmarks—ConceptRisk and T2VSafetyBench-TI2V—ConceptGuard achieves state-of-the-art performance in both risk detection accuracy and safe video generation quality, significantly outperforming existing methods.

Technology Category

Application Category

📝 Abstract

Recent progress in video generative models has enabled the creation of high-quality videos from multimodal prompts that combine text and images. While these systems offer enhanced controllability, they also introduce new safety risks, as harmful content can emerge from individual modalities or their interaction. Existing safety methods are often text-only, require prior knowledge of the risk category, or operate as post-generation auditors, struggling to proactively mitigate such compositional, multimodal risks. To address this challenge, we present ConceptGuard, a unified safeguard framework for proactively detecting and mitigating unsafe semantics in multimodal video generation. ConceptGuard operates in two stages: First, a contrastive detection module identifies latent safety risks by projecting fused image-text inputs into a structured concept space; Second, a semantic suppression mechanism steers the generative process away from unsafe concepts by intervening in the prompt's multimodal conditioning. To support the development and rigorous evaluation of this framework, we introduce two novel benchmarks: ConceptRisk, a large-scale dataset for training on multimodal risks, and T2VSafetyBench-TI2V, the first benchmark adapted from T2VSafetyBench for the Text-and-Image-to-Video (TI2V) safety setting. Comprehensive experiments on both benchmarks show that ConceptGuard consistently outperforms existing baselines, achieving state-of-the-art results in both risk detection and safe video generation.

Problem

Research questions and friction points this paper is trying to address.

Detects safety risks in multimodal video generation inputs

Proactively mitigates unsafe semantics before video generation

Addresses compositional risks from text-image interaction in prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proactive multimodal risk detection in video generation

Contrastive detection in structured concept space

Semantic suppression for safe multimodal conditioning

🔎 Similar Papers

No similar papers found.