Leveraging large multimodal models for audio-video deepfake detection: a pilot study

πŸ“… 2026-02-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes AV-LMMDetect, a novel approach to audio-visual deepfake detection that leverages the large multimodal language model Qwen 2.5 Omniβ€”the first such application in this domain. Addressing the limited generalization and cross-domain adaptability of existing methods that rely on small, task-specific models, AV-LMMDetect employs a prompt-driven binary classification framework to effectively fuse audio and visual cues. The method adopts a two-stage fine-tuning strategy: first aligning the model with lightweight LoRA adaptation, followed by full-parameter fine-tuning of the audio and visual encoders. Evaluated on the FakeAVCeleb and Mavos-DD benchmarks, the proposed approach achieves competitive or superior performance compared to current state-of-the-art methods, setting a new record on Mavos-DD and demonstrating significantly enhanced cross-dataset generalization.

Technology Category

Application Category

πŸ“ Abstract
Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification -"Is this video real or fake?". Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.
Problem

Research questions and friction points this paper is trying to address.

audio-visual deepfake detection
multimodal models
generalization
scalability
deepfake detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

large multimodal model
audio-visual deepfake detection
prompt-based classification
LoRA fine-tuning
cross-domain generalization
πŸ”Ž Similar Papers
No similar papers found.