Benchmarking Cross-Domain Audio-Visual Deception Detection

📅 2024-05-11

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Current audio-visual spoofing detection methods suffer from poor cross-scenario generalization and lack a standardized cross-domain evaluation benchmark. To address this, we introduce the first unified, standardized cross-domain benchmark for audio-visual spoofing detection, supporting both single-source-to-single-target and multi-source-to-single-target domain adaptation settings. We propose MM-IDGM, a gradient-coordinated optimization algorithm, and Attention-Mixer, a novel multimodal fusion architecture. Additionally, we design three novel multi-source domain sampling strategies and integrate OpenSMILE/ResNet-50 feature extractors with CNN/RNN/Transformer backbones. Extensive experiments demonstrate that our approach achieves an average accuracy improvement of 5.2% under the multi-source-to-single-target setting, significantly enhancing cross-domain generalization. The benchmark and methodology provide a reproducible, comparable, and realistic evaluation framework for practical deployment.

Technology Category

Application Category

📝 Abstract

Automated deception detection is crucial for assisting humans in accurately assessing truthfulness and identifying deceptive behavior. Conventional contact-based techniques, like polygraph devices, rely on physiological signals to determine the authenticity of an individual's statements. Nevertheless, recent developments in automated deception detection have demonstrated that multimodal features derived from both audio and video modalities may outperform human observers on publicly available datasets. Despite these positive findings, the generalizability of existing audio-visual deception detection approaches across different scenarios remains largely unexplored. To close this gap, we present the first cross-domain audio-visual deception detection benchmark, that enables us to assess how well these methods generalize for use in real-world scenarios. We used widely adopted audio and visual features and different architectures for benchmarking, comparing single-to-single and multi-to-single domain generalization performance. To further exploit the impacts using data from multiple source domains for training, we investigate three types of domain sampling strategies, including domain-simultaneous, domain-alternating, and domain-by-domain for multi-to-single domain generalization evaluation. We also propose an algorithm to enhance the generalization performance by maximizing the gradient inner products between modality encoders, named ``MM-IDGM". Furthermore, we proposed the Attention-Mixer fusion method to improve performance, and we believe that this new cross-domain benchmark will facilitate future research in audio-visual deception detection.

Problem

Research questions and friction points this paper is trying to address.

Assessing generalizability of audio-visual deception detection across domains

Exploring domain sampling strategies for multi-source training data

Improving cross-domain performance via modality encoder optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-domain audio-visual deception detection benchmark

MM-IDGM algorithm for generalization enhancement

Attention-Mixer fusion method for improved performance

🔎 Similar Papers

Audio Anti-Spoofing Detection: A Survey