🤖 AI Summary
This work addresses the limitations of existing image manipulation detection benchmarks, which rely heavily on object masks and consequently struggle to capture subtle manipulations outside annotated regions while often misclassifying unedited areas as tampered. To overcome these issues, we propose the first unified framework for manipulation understanding that jointly models pixel-level tampering maps, semantic categories, and natural language descriptions. By integrating editing primitives with a semantic taxonomy, introducing quantified localization confidence, and establishing a multidimensional evaluation protocol, our approach transcends the conventional mask-based paradigm and significantly enhances detection of both micro-edits and out-of-mask manipulations. Our study further exposes evaluation biases inherent in current mask-centric metrics and establishes more rigorous standards for localization, classification, and descriptive fidelity. Code and data are publicly released.
📝 Abstract
Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are untouched or only trivially modified, while subtle yet consequential edits outside the mask are treated as natural. We reformulate VLM image tampering from coarse region labels to a pixel-grounded, meaning and language-aware task. First, we introduce a taxonomy spanning edit primitives (replace/remove/splice/inpaint/attribute/colorization, etc.) and their semantic class of tampered object, linking low-level changes to high-level understanding. Second, we release a new benchmark with per-pixel tamper maps and paired category supervision to evaluate detection and classification within a unified protocol. Third, we propose a training framework and evaluation metrics that quantify pixel-level correctness with localization to assess confidence or prediction on true edit intensity, and further measure tamper meaning understanding via semantics-aware classification and natural language descriptions for the predicted regions. We also re-evaluate the existing strong segmentation/localization baselines on recent strong tamper detectors and reveal substantial over- and under-scoring using mask-only metrics, and expose failure modes on micro-edits and off-mask changes. Our framework advances the field from masks to pixels, meanings and language descriptions, establishing a rigorous standard for tamper localization, semantic classification and description. Code and benchmark data are available at https://github.com/VILA-Lab/PIXAR.