MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing evaluation methods predominantly rely on binary classification, failing to capture fine-grained multimodal harm understanding by multimodal large language models (mLLMs) in complex contexts—e.g., social media memes. To address this, we propose an agent-based arena-style evaluation framework that generates contextualized tasks and employs a multi-perspective consensus mechanism to simulate diverse interpretive scenarios, enabling context-aware, debiased assessment of harmful content understanding. Our approach innovatively integrates adversarial agent architectures, multi-perspective ensemble judgment, and human preference alignment techniques. Experiments demonstrate that the framework significantly reduces adjudicator bias and achieves strong agreement with human judgments (Spearman’s ρ > 0.92). It outperforms baseline methods in fine-grainedness, contextual adaptability, and fairness, establishing a scalable, interpretable paradigm for multimodal content safety evaluation.

Technology Category

Application Category

📝 Abstract

The proliferation of memes on social media necessitates the capabilities of multimodal Large Language Models (mLLMs) to effectively understand multimodal harmfulness. Existing evaluation approaches predominantly focus on mLLMs' detection accuracy for binary classification tasks, which often fail to reflect the in-depth interpretive nuance of harmfulness across diverse contexts. In this paper, we propose MemeArena, an agent-based arena-style evaluation framework that provides a context-aware and unbiased assessment for mLLMs' understanding of multimodal harmfulness. Specifically, MemeArena simulates diverse interpretive contexts to formulate evaluation tasks that elicit perspective-specific analyses from mLLMs. By integrating varied viewpoints and reaching consensus among evaluators, it enables fair and unbiased comparisons of mLLMs' abilities to interpret multimodal harmfulness. Extensive experiments demonstrate that our framework effectively reduces the evaluation biases of judge agents, with judgment results closely aligning with human preferences, offering valuable insights into reliable and comprehensive mLLM evaluations in multimodal harmfulness understanding. Our code and data are publicly available at https://github.com/Lbotirx/MemeArena.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal harmfulness understanding in mLLMs

Addressing bias in context-aware harmfulness interpretation

Automating unbiased assessment of nuanced multimodal harmfulness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent-based framework for unbiased multimodal harmfulness evaluation

Simulates diverse contexts to elicit perspective-specific model analyses

Integrates varied viewpoints to reach consensus among evaluators

🔎 Similar Papers

HateSieve: A Contrastive Learning Framework for Detecting and Segmenting Hateful Content in Multimodal Memes