🤖 AI Summary
This work addresses the high computational cost of existing vision-language models in high-resolution multimedia forensics and the tendency of conventional token pruning methods to discard forgery traces due to their semantic bias. The authors propose a training-free token compression framework that, for the first time, approaches token selection from the perspective of forgery detection. By integrating a Birth-Death optimal transport model with relaxed dummy nodes and a high-frequency spectral prior, the method dynamically scores and retains tokens capturing physical discontinuities indicative of manipulation. With only 10% of tokens preserved, the approach achieves a 2.97× inference speedup and over 90% reduction in FLOPs, while maintaining state-of-the-art performance in both deepfake and AIGC-generated content detection.
📝 Abstract
Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational costs, particularly for high-resolution images and videos. Visual token pruning is a practical acceleration strategy, yet existing methods are largely semantic-driven, retaining salient objects while discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters often reside. To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities indicating transient generative artifacts. The forensic scoring further integrates transport-based novelty with high-frequency priors to separate forensic evidence from semantic content under large-ratio compression. Experiments on deepfake and AIGC benchmarks show that at 10\% token retention, ForensicZip achieves $2.97\times$ speedup and over 90\% FLOPs reduction while maintaining state-of-the-art detection performance.