AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual (AV) forgery detection benchmarks are limited to DeepFake-style manipulations and coarse-grained annotations, failing to reflect the diversity and complexity of real-world scenarios. To address this, we introduce AVFakeBench—the first multimodal AV forgery detection benchmark covering both human and non-human subjects. It comprises 12K high-quality samples, seven distinct forgery categories, and four-level fine-grained annotations, enabling comprehensive evaluation across binary classification, forgery-type identification, spatial-temporal localization, and logical reasoning. We propose a novel multi-stage hybrid forgery generation framework integrating task planning and expert models, and establish a hierarchical evaluation protocol tailored for audio-visual large language models (AV-LMMs). Leveraging multimodal understanding, fine-grained contrastive learning, causal reasoning, and semantic consistency modeling, we evaluate 11 AV-LMMs and two classes of detection methods. Results reveal their strong potential yet critical weaknesses in fine-grained perception and logical inference, establishing AVFakeBench as a rigorous, authoritative benchmark for future research.

Technology Category

Application Category

📝 Abstract
The threat of Audio-Video (AV) forgery is rapidly evolving beyond human-centric deepfakes to include more diverse manipulations across complex natural scenes. However, existing benchmarks are still confined to DeepFake-based forgeries and single-granularity annotations, thus failing to capture the diversity and complexity of real-world forgery scenarios. To address this, we introduce AVFakeBench, the first comprehensive audio-video forgery detection benchmark that spans rich forgery semantics across both human subject and general subject. AVFakeBench comprises 12K carefully curated audio-video questions, covering seven forgery types and four levels of annotations. To ensure high-quality and diverse forgeries, we propose a multi-stage hybrid forgery framework that integrates proprietary models for task planning with expert generative models for precise manipulation. The benchmark establishes a multi-task evaluation framework covering binary judgment, forgery types classification, forgery detail selection, and explanatory reasoning. We evaluate 11 Audio-Video Large Language Models (AV-LMMs) and 2 prevalent detection methods on AVFakeBench, demonstrating the potential of AV-LMMs as emerging forgery detectors while revealing their notable weaknesses in fine-grained perception and reasoning.
Problem

Research questions and friction points this paper is trying to address.

Addressing limited forgery diversity and annotation granularity in existing benchmarks
Developing comprehensive audio-video detection covering human and general subjects
Evaluating AV-LMMs' capabilities and weaknesses in fine-grained forgery analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage hybrid forgery framework for diverse manipulations
Integration of proprietary models with expert generative models
Multi-task evaluation covering binary judgment to explanatory reasoning
🔎 Similar Papers
2024-04-22arXiv.orgCitations: 25
Shuhan Xia
Shuhan Xia
北京邮电大学
人工智能 多模态
Peipei Li
Peipei Li
Beijing University of Posts and Telecommunications (BUPT)
Computer VisionImage SynthesisFace Recognition
X
Xuannan Liu
Beijing University of Posts and Telecommunications
D
Dongsen Zhang
Beijing University of Posts and Telecommunications
Xinyu Guo
Xinyu Guo
Samsung Research America
AIcomputer visionmachine learningmedical image analysis
Z
Zekun Li
University of California, Santa Barbara