Investigating the Viability of Employing Multi-modal Large Language Models in the Context of Audio Deepfake Detection

📅 2026-01-02
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limitations of existing audio deepfake detection methods in cross-modal representation and generalization by systematically exploring, for the first time, the potential of multimodal large language models (MLLMs) for this task. The authors formulate the problem as a question-answering binary classification task, integrating audio inputs with diverse textual prompts to guide the model toward cross-modal feature-aware reasoning. Experiments are conducted using Qwen2-Audio-7B-Instruct and SALMONN under both zero-shot and fine-tuned settings, employing a multi-prompt strategy for evaluation. Results show that while zero-shot performance remains limited, the models achieve strong results on in-domain data with minimal supervision, demonstrating the feasibility and effectiveness of the proposed text-prompt-driven cross-modal learning paradigm.

Technology Category

Application Category

📝 Abstract
While Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have shown strong generalisation in detecting image and video deepfakes, their use for audio deepfake detection remains largely unexplored. In this work, we aim to explore the potential of MLLMs for audio deepfake detection. Combining audio inputs with a range of text prompts as queries to find out the viability of MLLMs to learn robust representations across modalities for audio deepfake detection. Therefore, we attempt to explore text-aware and context-rich, question-answer based prompts with binary decisions. We hypothesise that such a feature-guided reasoning will help in facilitating deeper multimodal understanding and enable robust feature learning for audio deepfake detection. We evaluate the performance of two MLLMs, Qwen2-Audio-7B-Instruct and SALMONN, in two evaluation modes: (a) zero-shot and (b) fine-tuned. Our experiments demonstrate that combining audio with a multi-prompt approach could be a viable way forward for audio deepfake detection. Our experiments show that the models perform poorly without task-specific training and struggle to generalise to out-of-domain data. However, they achieve good performance on in-domain data with minimal supervision, indicating promising potential for audio deepfake detection.
Problem

Research questions and friction points this paper is trying to address.

audio deepfake detection
multimodal large language models
deepfake
zero-shot learning
out-of-domain generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models
Audio Deepfake Detection
Prompt-based Reasoning
Cross-modal Representation
Zero-shot and Fine-tuned Evaluation
🔎 Similar Papers
No similar papers found.
A
Akanksha Chuchra
Indian Institute of Technology, Ropar, India
S
Shukesh Reddy
Machine Intelligence Group, Birla Institute of Technology and Science, Pilani, Hyderabad Campus, India
S
Sudeepta Mishra
Indian Institute of Technology, Ropar, India
Abhijit Das
Abhijit Das
BITS Pilani Hyderabad, Dept of CS&IS
Computer VisionPattern RecognitionMachine Learning
A
A. Dhall
Monash University, Melbourne, Australia