DOCFORGE-BENCH: A Comprehensive Benchmark for Document Forgery Detection and Analysis

📅 2026-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the poor performance of existing document forgery detection methods in real-world, label-free scenarios and the absence of a standardized zero-shot evaluation benchmark. We introduce the first zero-shot benchmark specifically designed for document forgery detection, encompassing eight forgery types—including text tampering, receipt forgery, and identity document manipulation—and evaluate 14 state-of-the-art methods without fine-tuning or domain adaptation. Our analysis reveals that current approaches suffer from threshold calibration failure, with performance bottlenecks stemming from suboptimal thresholds rather than feature representations. Remarkably, adapting thresholds using only ten unlabeled images recovers 39–55% of the performance gap. Comprehensive evaluation via Pixel-AUC, Pixel-F1, and Oracle-F1 metrics demonstrates that document forgery detection remains far from solved in practical settings.

Technology Category

Application Category

📝 Abstract
We present DOCFORGE-BENCH, the first unified zero-shot benchmark for document forgery detection, evaluating 14 methods across eight datasets spanning text tampering, receipt forgery, and identity document manipulation. Unlike fine-tuning-oriented evaluations such as ForensicHub [Du et al., 2025], DOCFORGE-BENCH applies all methods with their published pretrained weights and no domain adaptation -- a deliberate design choice that reflects the realistic deployment scenario where practitioners lack labeled document training data. Our central finding is a pervasive calibration failure invisible under single-threshold protocols: methods achieve moderate Pixel-AUC (>=0.76) yet near-zero Pixel-F1. This AUC-F1 gap is not a discrimination failure but a score-distribution shift: tampered regions occupy only 0.27-4.17% of pixels in document images -- an order of magnitude less than in natural image benchmarks -- making the standard tau=0.5 threshold catastrophically miscalibrated. Oracle-F1 is 2-10x higher than fixed-threshold Pixel-F1, confirming that calibration, not representation, is the bottleneck. A controlled calibration experiment validates this: adapting a single threshold on N=10 domain images recovers 39-55% of the Oracle-F1 gap, demonstrating that threshold adaptation -- not retraining -- is the key missing step for practical deployment. Overall, no evaluated method works reliably out-of-the-box on diverse document types, underscoring that document forgery detection remains an unsolved problem. We further note that all eight datasets predate the era of generative AI editing; benchmarks covering diffusion- and LLM-based document forgeries represent a critical open gap on the modern attack surface.
Problem

Research questions and friction points this paper is trying to address.

document forgery detection
zero-shot evaluation
threshold calibration
generative AI forgeries
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot benchmark
document forgery detection
calibration failure
threshold adaptation
Pixel-AUC vs Pixel-F1
🔎 Similar Papers
Z
Zengqi Zhao
University of North Carolina at Chapel Hill
W
Weidi Xia
University of California, Irvine
P
Peter Wei
Washington University in St. Louis
Y
Yan Zhang
Scam.ai
Yiyi Zhang
Yiyi Zhang
Cornell University
Computer VisionGenerative Models
J
Jane Mo
Duke University
T
Tiannan Zhang
University of California, Davis
Y
Yuanqin Dai
Scam.ai
Zexi Chen
Zexi Chen
Zhejiang University
RoboticsPerceptionSLAMComputer Vision
S
Simiao Ren
Scam.ai