Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies

📅 2025-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Detecting subtle spatiotemporal inconsistencies in lip movements remains challenging for deepfake videos with high-fidelity lip-sync. Method: We propose LIPINC-V2, a visual temporal Transformer incorporating multi-head cross-attention to jointly model short- and long-term lip-motion anomalies; it focuses exclusively on the lip region for fine-grained spatiotemporal feature learning and cross-frame dynamic anomaly perception. Contribution/Results: We introduce LipSyncTIMIT—the first benchmark dataset covering five state-of-the-art lip-sync models—to enable systematic evaluation. On LipSyncTIMIT and two public benchmarks, LIPINC-V2 achieves new state-of-the-art detection accuracy, particularly against high-fidelity lip-sync deepfakes. This work establishes a novel paradigm for lip-sync forgery detection and provides critical, diverse data support for advancing robust authentication methods.

Technology Category

Application Category

📝 Abstract
Deepfakes are AI-generated media in which the original content is digitally altered to create convincing but manipulated images, videos, or audio. Among the various types of deepfakes, lip-syncing deepfakes are one of the most challenging deepfakes to detect. In these videos, a person's lip movements are synthesized to match altered or entirely new audio using AI models. Therefore, unlike other types of deepfakes, the artifacts in lip-syncing deepfakes are confined to the mouth region, making them more subtle and, thus harder to discern. In this paper, we propose LIPINC-V2, a novel detection framework that leverages a combination of vision temporal transformer with multihead cross-attention to detect lip-syncing deepfakes by identifying spatiotemporal inconsistencies in the mouth region. These inconsistencies appear across adjacent frames and persist throughout the video. Our model can successfully capture both short-term and long-term variations in mouth movement, enhancing its ability to detect these inconsistencies. Additionally, we created a new lip-syncing deepfake dataset, LipSyncTIMIT, which was generated using five state-of-the-art lip-syncing models to simulate real-world scenarios. Extensive experiments on our proposed LipSyncTIMIT dataset and two other benchmark deepfake datasets demonstrate that our model achieves state-of-the-art performance. The code and the dataset are available at https://github.com/skrantidatta/LIPINC-V2 .
Problem

Research questions and friction points this paper is trying to address.

Detect lip-syncing deepfakes via mouth inconsistencies
Identify spatiotemporal artifacts in mouth movements
Improve detection accuracy with temporal vision transformers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision temporal transformer for mouth inconsistencies
Multihead cross-attention captures spatiotemporal artifacts
LipSyncTIMIT dataset for realistic deepfake simulation
🔎 Similar Papers
No similar papers found.
S
Soumyya Kanti Datta
University at Buffalo, State University of New York
Shan Jia
Shan Jia
Google
Digital Media ForensicsDeepFakesBiometricsComputer Vision
S
Siwei Lyu
University at Buffalo, State University of New York