Towards Generalizable Deepfake Detection via Forgery-aware Audio-Visual Adaptation: A Variational Bayesian Approach

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

To address cross-modal security threats posed by audio-visual deepfakes in the AIGC era, this paper proposes a multimodal detection framework based on variational Bayesian correlation modeling. Methodologically, it models audio-visual cross-modal correlations as Gaussian latent variables and enforces orthogonality constraints to disentangle modality-specific features from shared correlation features, thereby jointly capturing local tampering artifacts and global inconsistencies. The framework integrates pretrained backbone networks, differential convolution, and high-pass filtering to extract forgery-aware representations. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across multiple benchmark datasets, with strong generalization capability—particularly robust under out-of-distribution settings and low-quality forgery conditions.

Technology Category

Application Category

📝 Abstract

The widespread application of AIGC contents has brought not only unprecedented opportunities, but also potential security concerns, e.g., audio-visual deepfakes. Therefore, it is of great importance to develop an effective and generalizable method for multi-modal deepfake detection. Typically, the audio-visual correlation learning could expose subtle cross-modal inconsistencies, e.g., audio-visual misalignment, which serve as crucial clues in deepfake detection. In this paper, we reformulate the correlation learning with variational Bayesian estimation, where audio-visual correlation is approximated as a Gaussian distributed latent variable, and thus develop a novel framework for deepfake detection, i.e., Forgery-aware Audio-Visual Adaptation with Variational Bayes (FoVB). Specifically, given the prior knowledge of pre-trained backbones, we adopt two core designs to estimate audio-visual correlations effectively. First, we exploit various difference convolutions and a high-pass filter to discern local and global forgery traces from both modalities. Second, with the extracted forgery-aware features, we estimate the latent Gaussian variable of audio-visual correlation via variational Bayes. Then, we factorize the variable into modality-specific and correlation-specific ones with orthogonality constraint, allowing them to better learn intra-modal and cross-modal forgery traces with less entanglement. Extensive experiments demonstrate that our FoVB outperforms other state-of-the-art methods in various benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Detecting audio-visual deepfakes through cross-modal inconsistency analysis

Developing generalizable multi-modal deepfake detection using variational Bayesian methods

Learning disentangled forgery traces from audio and visual modalities with orthogonality constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational Bayesian estimation for audio-visual correlation

Difference convolutions and high-pass filter for forgery traces

Orthogonality constraint to separate modality and correlation factors

🔎 Similar Papers

No similar papers found.