🤖 AI Summary
Existing deepfake benchmarks primarily focus on identity swapping or localized editing, lacking large-scale datasets that support human-centered high-level semantic manipulation—such as actions, scenes, and human-object interactions—with explicit reasoning capabilities. To address this gap, we propose MultiFakeVerse, the first large-scale deepfake benchmark dedicated to human-centered semantic manipulation, comprising 845,000 images generated via vision-language models (VLMs). Our approach introduces a novel VLM-instruction-guided, semantics-driven generation paradigm that enables context-aware manipulation grounded in narrative intent and perceptual importance, integrating controllable image synthesis with multimodal semantic alignment. Extensive experiments demonstrate that state-of-the-art detectors and human observers struggle to identify these concept-level manipulations, confirming their strong imperceptibility. The dataset and code are publicly released.
📝 Abstract
The rapid advancement of GenAI technology over the past few years has significantly contributed towards highly realistic deepfake content generation. Despite ongoing efforts, the research community still lacks a large-scale and reasoning capability driven deepfake benchmark dataset specifically tailored for person-centric object, context and scene manipulations. In this paper, we address this gap by introducing MultiFakeVerse, a large scale person-centric deepfake dataset, comprising 845,286 images generated through manipulation suggestions and image manipulations both derived from vision-language models (VLM). The VLM instructions were specifically targeted towards modifications to individuals or contextual elements of a scene that influence human perception of importance, intent, or narrative. This VLM-driven approach enables semantic, context-aware alterations such as modifying actions, scenes, and human-object interactions rather than synthetic or low-level identity swaps and region-specific edits that are common in existing datasets. Our experiments reveal that current state-of-the-art deepfake detection models and human observers struggle to detect these subtle yet meaningful manipulations. The code and dataset are available on href{https://github.com/Parul-Gupta/MultiFakeVerse}{GitHub}.