Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This study investigates safety risks of vision-language models (VLMs) when processing real-world internet meme images—a previously underexplored yet ecologically critical threat surface. Method: We introduce MemeSafetyBench, the first large-scale, ecologically valid meme safety benchmark, comprising over 50,000 authentic meme images and a multi-round red-teaming protocol. Our methodology integrates LLM-driven instruction generation, a hierarchical safety taxonomy, and cross-model-scale comparative analysis. Contribution/Results: We find that meme images significantly exacerbate harmful VLM outputs—increasing toxicity relative to synthetic images or text-only inputs—while simultaneously reducing refusal rates and yielding more covert, virally potent unsafe responses. Multi-turn dialogue mitigates this risk only partially. This work is the first to systematically characterize memes as a distinct, high-risk modality for VLM safety and establishes a new evaluation paradigm grounded in realistic usage contexts.

Technology Category

Application Category

📝 Abstract

Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet most evaluations rely on artificial images. This study asks: How safe are current VLMs when confronted with meme images that ordinary users share? To investigate this question, we introduce MemeSafetyBench, a 50,430-instance benchmark pairing real meme images with both harmful and benign instructions. Using a comprehensive safety taxonomy and LLM-based instruction generation, we assess multiple VLMs across single and multi-turn interactions. We investigate how real-world memes influence harmful outputs, the mitigating effects of conversational context, and the relationship between model scale and safety metrics. Our findings demonstrate that VLMs show greater vulnerability to meme-based harmful prompts than to synthetic or typographic images. Memes significantly increase harmful responses and decrease refusals compared to text-only inputs. Though multi-turn interactions provide partial mitigation, elevated vulnerability persists. These results highlight the need for ecologically valid evaluations and stronger safety mechanisms.

Problem

Research questions and friction points this paper is trying to address.

Assessing safety risks of VLMs with real meme images

Evaluating meme influence on harmful VLM outputs

Investigating mitigation effects in multi-turn interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MemeSafetyBench for real meme evaluation

Assesses VLMs with harmful and benign instructions

Investigates meme impact on model safety metrics

🔎 Similar Papers

Exploring the Limits of Zero Shot Vision Language Models for Hate Meme Detection: The Vulnerabilities and their Interpretations