"Humor, Art, or Misinformation?": A Multimodal Dataset for Intent-Aware Synthetic Image Detection

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing AI-generated image detection methods primarily focus on binary authenticity classification, neglecting the critical task of inferring the generative intent behind synthetic content. Method: This paper introduces the first intent-aware detection task tailored to real-world social media scenarios and presents S-HArM—a novel multimodal dataset comprising 9,576 image-text pairs, systematically annotated with three intent categories: humor, artistic expression, and disinformation. We propose a tripartite prompting strategy—leveraging image-, caption-, and multimodal-guided prompts—to synthesize high-fidelity training data via Stable Diffusion. Our intent-aware detection framework integrates contrastive learning, reconstruction networks, attention mechanisms, and large vision-language models. Contribution/Results: Experiments demonstrate that multimodal-guided synthetic data substantially improves generalization in realistic settings; however, intent inference remains challenging, necessitating purpose-built architectures. This work establishes a new paradigm and benchmark for multimodal content understanding.

Technology Category

Application Category

📝 Abstract

Recent advances in multimodal AI have enabled progress in detecting synthetic and out-of-context content. However, existing efforts largely overlook the intent behind AI-generated images. To fill this gap, we introduce S-HArM, a multimodal dataset for intent-aware classification, comprising 9,576 "in the wild" image-text pairs from Twitter/X and Reddit, labeled as Humor/Satire, Art, or Misinformation. Additionally, we explore three prompting strategies (image-guided, description-guided, and multimodally-guided) to construct a large-scale synthetic training dataset with Stable Diffusion. We conduct an extensive comparative study including modality fusion, contrastive learning, reconstruction networks, attention mechanisms, and large vision-language models. Our results show that models trained on image- and multimodally-guided data generalize better to "in the wild" content, due to preserved visual context. However, overall performance remains limited, highlighting the complexity of inferring intent and the need for specialized architectures.

Problem

Research questions and friction points this paper is trying to address.

Detecting intent behind AI-generated synthetic images

Classifying multimodal content as humor, art, or misinformation

Addressing generalization challenges in real-world synthetic image detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal dataset for intent-aware classification

Three prompting strategies with Stable Diffusion

Comparative study of modality fusion techniques

🔎 Similar Papers

ImagiNet: A Multi-Content Benchmark for Synthetic Image Detection