Investigating the Viability of Employing Multi-modal Large Language Models in the Context of Audio Deepfake Detection

📅 2026-01-02

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This study addresses the limitations of existing audio deepfake detection methods in cross-modal representation and generalization by systematically exploring, for the first time, the potential of multimodal large language models (MLLMs) for this task. The authors formulate the problem as a question-answering binary classification task, integrating audio inputs with diverse textual prompts to guide the model toward cross-modal feature-aware reasoning. Experiments are conducted using Qwen2-Audio-7B-Instruct and SALMONN under both zero-shot and fine-tuned settings, employing a multi-prompt strategy for evaluation. Results show that while zero-shot performance remains limited, the models achieve strong results on in-domain data with minimal supervision, demonstrating the feasibility and effectiveness of the proposed text-prompt-driven cross-modal learning paradigm.

Technology Category

Application Category

📝 Abstract

While Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have shown strong generalisation in detecting image and video deepfakes, their use for audio deepfake detection remains largely unexplored. In this work, we aim to explore the potential of MLLMs for audio deepfake detection. Combining audio inputs with a range of text prompts as queries to find out the viability of MLLMs to learn robust representations across modalities for audio deepfake detection. Therefore, we attempt to explore text-aware and context-rich, question-answer based prompts with binary decisions. We hypothesise that such a feature-guided reasoning will help in facilitating deeper multimodal understanding and enable robust feature learning for audio deepfake detection. We evaluate the performance of two MLLMs, Qwen2-Audio-7B-Instruct and SALMONN, in two evaluation modes: (a) zero-shot and (b) fine-tuned. Our experiments demonstrate that combining audio with a multi-prompt approach could be a viable way forward for audio deepfake detection. Our experiments show that the models perform poorly without task-specific training and struggle to generalise to out-of-domain data. However, they achieve good performance on in-domain data with minimal supervision, indicating promising potential for audio deepfake detection.

Problem

Research questions and friction points this paper is trying to address.

audio deepfake detection

multimodal large language models

deepfake

zero-shot learning

out-of-domain generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models

Audio Deepfake Detection

Prompt-based Reasoning