Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies

📅 2025-04-02

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Detecting subtle spatiotemporal inconsistencies in lip movements remains challenging for deepfake videos with high-fidelity lip-sync. Method: We propose LIPINC-V2, a visual temporal Transformer incorporating multi-head cross-attention to jointly model short- and long-term lip-motion anomalies; it focuses exclusively on the lip region for fine-grained spatiotemporal feature learning and cross-frame dynamic anomaly perception. Contribution/Results: We introduce LipSyncTIMIT—the first benchmark dataset covering five state-of-the-art lip-sync models—to enable systematic evaluation. On LipSyncTIMIT and two public benchmarks, LIPINC-V2 achieves new state-of-the-art detection accuracy, particularly against high-fidelity lip-sync deepfakes. This work establishes a novel paradigm for lip-sync forgery detection and provides critical, diverse data support for advancing robust authentication methods.

Technology Category

Application Category

📝 Abstract

Deepfakes are AI-generated media in which the original content is digitally altered to create convincing but manipulated images, videos, or audio. Among the various types of deepfakes, lip-syncing deepfakes are one of the most challenging deepfakes to detect. In these videos, a person's lip movements are synthesized to match altered or entirely new audio using AI models. Therefore, unlike other types of deepfakes, the artifacts in lip-syncing deepfakes are confined to the mouth region, making them more subtle and, thus harder to discern. In this paper, we propose LIPINC-V2, a novel detection framework that leverages a combination of vision temporal transformer with multihead cross-attention to detect lip-syncing deepfakes by identifying spatiotemporal inconsistencies in the mouth region. These inconsistencies appear across adjacent frames and persist throughout the video. Our model can successfully capture both short-term and long-term variations in mouth movement, enhancing its ability to detect these inconsistencies. Additionally, we created a new lip-syncing deepfake dataset, LipSyncTIMIT, which was generated using five state-of-the-art lip-syncing models to simulate real-world scenarios. Extensive experiments on our proposed LipSyncTIMIT dataset and two other benchmark deepfake datasets demonstrate that our model achieves state-of-the-art performance. The code and the dataset are available at https://github.com/skrantidatta/LIPINC-V2 .

Problem

Research questions and friction points this paper is trying to address.

Detect lip-syncing deepfakes via mouth inconsistencies

Identify spatiotemporal artifacts in mouth movements

Improve detection accuracy with temporal vision transformers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision temporal transformer for mouth inconsistencies

Multihead cross-attention captures spatiotemporal artifacts

LipSyncTIMIT dataset for realistic deepfake simulation

🔎 Similar Papers

A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection

2024-09-23arXiv.orgCitations: 1