Towards Generalizable Deepfake Detection via Forgery-aware Audio-Visual Adaptation: A Variational Bayesian Approach

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address cross-modal security threats posed by audio-visual deepfakes in the AIGC era, this paper proposes a multimodal detection framework based on variational Bayesian correlation modeling. Methodologically, it models audio-visual cross-modal correlations as Gaussian latent variables and enforces orthogonality constraints to disentangle modality-specific features from shared correlation features, thereby jointly capturing local tampering artifacts and global inconsistencies. The framework integrates pretrained backbone networks, differential convolution, and high-pass filtering to extract forgery-aware representations. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across multiple benchmark datasets, with strong generalization capability—particularly robust under out-of-distribution settings and low-quality forgery conditions.

Technology Category

Application Category

📝 Abstract
The widespread application of AIGC contents has brought not only unprecedented opportunities, but also potential security concerns, e.g., audio-visual deepfakes. Therefore, it is of great importance to develop an effective and generalizable method for multi-modal deepfake detection. Typically, the audio-visual correlation learning could expose subtle cross-modal inconsistencies, e.g., audio-visual misalignment, which serve as crucial clues in deepfake detection. In this paper, we reformulate the correlation learning with variational Bayesian estimation, where audio-visual correlation is approximated as a Gaussian distributed latent variable, and thus develop a novel framework for deepfake detection, i.e., Forgery-aware Audio-Visual Adaptation with Variational Bayes (FoVB). Specifically, given the prior knowledge of pre-trained backbones, we adopt two core designs to estimate audio-visual correlations effectively. First, we exploit various difference convolutions and a high-pass filter to discern local and global forgery traces from both modalities. Second, with the extracted forgery-aware features, we estimate the latent Gaussian variable of audio-visual correlation via variational Bayes. Then, we factorize the variable into modality-specific and correlation-specific ones with orthogonality constraint, allowing them to better learn intra-modal and cross-modal forgery traces with less entanglement. Extensive experiments demonstrate that our FoVB outperforms other state-of-the-art methods in various benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Detecting audio-visual deepfakes through cross-modal inconsistency analysis
Developing generalizable multi-modal deepfake detection using variational Bayesian methods
Learning disentangled forgery traces from audio and visual modalities with orthogonality constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational Bayesian estimation for audio-visual correlation
Difference convolutions and high-pass filter for forgery traces
Orthogonality constraint to separate modality and correlation factors
🔎 Similar Papers
No similar papers found.
Fan Nie
Fan Nie
Stanford University
Generative AIAutonomous DrivingLarge Language Models
Jiangqun Ni
Jiangqun Ni
Professor, Sun Yat-Sen University
multimedia security
J
Jian Zhang
School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou 510006, China
B
Bin Zhang
Department of New Networks, Pengcheng Laboratory, Shenzhen 518066, China
Weizhe Zhang
Weizhe Zhang
Professor of Peng Cheng Laboratory & Harbin Institute of Technology
Parallel and Distributed SystemCloud ComputingRealtime SchedulingComputer Network
B
Bin Li
College of Information Engineering, Guangdong Key Laboratory of Intelligent Information Processing, and Shenzhen Key Laboratory of Media Security, Shenzhen University, Shenzhen 518060, China