ShieldVLM: Safeguarding the Multimodal Implicit Toxicity via Deliberative Reasoning with LVLMs

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of detecting multimodal implicit toxicity—harmful semantic combinations arising only when benign images and text are jointly interpreted—this work introduces the first systematic, fine-grained taxonomy (7 categories, 31 subcategories) and the inaugural dedicated benchmark dataset, MMIT, comprising 2,100 samples spanning five cross-modal association patterns. We propose ShieldVLM, a novel vision-language model integrating cross-modal alignment, hierarchical cautious reasoning, and risk溯源 (causal attribution) to support implicit toxicity detection at the sentence, prompt, and dialogue levels. Extensive experiments demonstrate that ShieldVLM significantly outperforms state-of-the-art baselines on both implicit and explicit toxicity detection. To foster reproducibility and community advancement, we fully open-source the code, model checkpoints, and MMIT dataset—establishing foundational resources for multimodal content safety research.

Technology Category

Application Category

📝 Abstract
Toxicity detection in multimodal text-image content faces growing challenges, especially with multimodal implicit toxicity, where each modality appears benign on its own but conveys hazard when combined. Multimodal implicit toxicity appears not only as formal statements in social platforms but also prompts that can lead to toxic dialogs from Large Vision-Language Models (LVLMs). Despite the success in unimodal text or image moderation, toxicity detection for multimodal content, particularly the multimodal implicit toxicity, remains underexplored. To fill this gap, we comprehensively build a taxonomy for multimodal implicit toxicity (MMIT) and introduce an MMIT-dataset, comprising 2,100 multimodal statements and prompts across 7 risk categories (31 sub-categories) and 5 typical cross-modal correlation modes. To advance the detection of multimodal implicit toxicity, we build ShieldVLM, a model which identifies implicit toxicity in multimodal statements, prompts and dialogs via deliberative cross-modal reasoning. Experiments show that ShieldVLM outperforms existing strong baselines in detecting both implicit and explicit toxicity. The model and dataset will be publicly available to support future researches. Warning: This paper contains potentially sensitive contents.
Problem

Research questions and friction points this paper is trying to address.

Detecting implicit toxicity in multimodal text-image content
Addressing underexplored toxicity in combined benign modalities
Improving detection of toxic prompts and dialogs in LVLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive taxonomy for multimodal implicit toxicity
Introduces MMIT-dataset with 2,100 multimodal samples
ShieldVLM detects toxicity via cross-modal reasoning
🔎 Similar Papers
No similar papers found.
Shiyao Cui
Shiyao Cui
Tsinghua University
Q
Qinglin Zhang
The Conversational AI (CoAI) group, DCST, Tsinghua University, China
O
Ouyang Xuan
The Conversational AI (CoAI) group, DCST, Tsinghua University, China
R
Renmiao Chen
The Conversational AI (CoAI) group, DCST, Tsinghua University, China
Zhexin Zhang
Zhexin Zhang
Tsinghua University, CoAI Group
NLPAI Safety & Alignment
Yida Lu
Yida Lu
Tsinghua University, CoAI Group
NLPAI Safety & Alignment
Hongning Wang
Hongning Wang
Associate Professor, Department of Computer Science and Technology, Tsinghua University
Machine LearningInformation RetrievalLarge Language Models
Han Qiu
Han Qiu
NTU
M
Minlie Huang
The Conversational AI (CoAI) group, DCST, Tsinghua University, China