DOCFORGE-BENCH: A Comprehensive Benchmark for Document Forgery Detection and Analysis

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the poor performance of existing document forgery detection methods in real-world, label-free scenarios and the absence of a standardized zero-shot evaluation benchmark. We introduce the first zero-shot benchmark specifically designed for document forgery detection, encompassing eight forgery types—including text tampering, receipt forgery, and identity document manipulation—and evaluate 14 state-of-the-art methods without fine-tuning or domain adaptation. Our analysis reveals that current approaches suffer from threshold calibration failure, with performance bottlenecks stemming from suboptimal thresholds rather than feature representations. Remarkably, adapting thresholds using only ten unlabeled images recovers 39–55% of the performance gap. Comprehensive evaluation via Pixel-AUC, Pixel-F1, and Oracle-F1 metrics demonstrates that document forgery detection remains far from solved in practical settings.

Technology Category

Application Category

📝 Abstract

We present DOCFORGE-BENCH, the first unified zero-shot benchmark for document forgery detection, evaluating 14 methods across eight datasets spanning text tampering, receipt forgery, and identity document manipulation. Unlike fine-tuning-oriented evaluations such as ForensicHub [Du et al., 2025], DOCFORGE-BENCH applies all methods with their published pretrained weights and no domain adaptation -- a deliberate design choice that reflects the realistic deployment scenario where practitioners lack labeled document training data. Our central finding is a pervasive calibration failure invisible under single-threshold protocols: methods achieve moderate Pixel-AUC (>=0.76) yet near-zero Pixel-F1. This AUC-F1 gap is not a discrimination failure but a score-distribution shift: tampered regions occupy only 0.27-4.17% of pixels in document images -- an order of magnitude less than in natural image benchmarks -- making the standard tau=0.5 threshold catastrophically miscalibrated. Oracle-F1 is 2-10x higher than fixed-threshold Pixel-F1, confirming that calibration, not representation, is the bottleneck. A controlled calibration experiment validates this: adapting a single threshold on N=10 domain images recovers 39-55% of the Oracle-F1 gap, demonstrating that threshold adaptation -- not retraining -- is the key missing step for practical deployment. Overall, no evaluated method works reliably out-of-the-box on diverse document types, underscoring that document forgery detection remains an unsolved problem. We further note that all eight datasets predate the era of generative AI editing; benchmarks covering diffusion- and LLM-based document forgeries represent a critical open gap on the modern attack surface.

Problem

Research questions and friction points this paper is trying to address.

document forgery detection

zero-shot evaluation

threshold calibration

generative AI forgeries

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot benchmark

document forgery detection

calibration failure