🤖 AI Summary
Existing AI image editing forensic methods are largely limited to binary authenticity classification or localization of manipulated regions, lacking the ability to identify semantic editing types and provide interpretable reasoning. To address this gap, this work introduces EditSleuth, a dataset comprising 257,725 image editing triplets, each annotated with source and edited images, masks, 12 semantic labels, non-degenerate three-component difficulty scores, and six-step deterministic reasoning chains rigorously grounded in computable visual evidence. Leveraging automatically constructed triplets and rule-generated reasoning chains, we apply chain-supervised fine-tuning (via LoRA) on Qwen2-VL-2B, achieving classification accuracy comparable to label-supervised baselines while enabling, for the first time, interpretable textual explanations of editing operations. The full dataset, construction pipeline, and training code are publicly released.
📝 Abstract
Forensic analysis of AI-edited images requires more than binary real-versus-fake prediction: a useful system should localize the edit, identify its semantic type, and ground its decisions in visual evidence. Existing image-forensics datasets typically emphasize detection or localization, while reasoning-supervised vision-language datasets rarely target image manipulation and often rely on LLM-generated rationales whose faithfulness is difficult to verify. We introduce EditSleuth, a dataset of 257,725 image-edit triplets constructed from existing image-editing corpora for grounded image-edit forensic reasoning. Each example includes an edited image, its source image, a binary edit mask, a 12-class edit taxonomy label, a difficulty score, and a six-step reasoning chain. EditSleuth chains are generated deterministically from triplet-grounded upstream artifacts, with each statement tied to a specific computable source of evidence. Our analysis reveals that a naive four-component difficulty formulation suffers from a rank-2 correlation collapse among magnitude features; a simplified three-component formulation substantially increases score dispersion on both Pico-Banana and MagicBrush. Difficulty also varies meaningfully within most edit categories, indicating that the score is not a proxy for edit type. As an initial learning study, we fine-tune Qwen2-VL-2B with LoRA and find that chain-as-target supervision matches a label-only baseline on classification accuracy among parseable answers, while additionally yielding grounded explanatory prose that label-only supervision cannot produce. We release the dataset, the deterministic construction pipeline, and pilot training scripts.