Leveraging large multimodal models for audio-video deepfake detection: a pilot study

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

264K/year

🤖 AI Summary

This work proposes AV-LMMDetect, a novel approach to audio-visual deepfake detection that leverages the large multimodal language model Qwen 2.5 Omni—the first such application in this domain. Addressing the limited generalization and cross-domain adaptability of existing methods that rely on small, task-specific models, AV-LMMDetect employs a prompt-driven binary classification framework to effectively fuse audio and visual cues. The method adopts a two-stage fine-tuning strategy: first aligning the model with lightweight LoRA adaptation, followed by full-parameter fine-tuning of the audio and visual encoders. Evaluated on the FakeAVCeleb and Mavos-DD benchmarks, the proposed approach achieves competitive or superior performance compared to current state-of-the-art methods, setting a new record on Mavos-DD and demonstrating significantly enhanced cross-dataset generalization.

Technology Category

Application Category

📝 Abstract

Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification -"Is this video real or fake?". Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.

Problem

Research questions and friction points this paper is trying to address.

audio-visual deepfake detection

multimodal models

generalization

scalability

deepfake detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

large multimodal model

audio-visual deepfake detection

prompt-based classification

LoRA fine-tuning

cross-domain generalization

🔎 Similar Papers

No similar papers found.