π€ AI Summary
This work proposes AV-LMMDetect, a novel approach to audio-visual deepfake detection that leverages the large multimodal language model Qwen 2.5 Omniβthe first such application in this domain. Addressing the limited generalization and cross-domain adaptability of existing methods that rely on small, task-specific models, AV-LMMDetect employs a prompt-driven binary classification framework to effectively fuse audio and visual cues. The method adopts a two-stage fine-tuning strategy: first aligning the model with lightweight LoRA adaptation, followed by full-parameter fine-tuning of the audio and visual encoders. Evaluated on the FakeAVCeleb and Mavos-DD benchmarks, the proposed approach achieves competitive or superior performance compared to current state-of-the-art methods, setting a new record on Mavos-DD and demonstrating significantly enhanced cross-dataset generalization.
π Abstract
Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification -"Is this video real or fake?". Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.