🤖 AI Summary
Detecting subtle spatiotemporal inconsistencies in lip movements remains challenging for deepfake videos with high-fidelity lip-sync. Method: We propose LIPINC-V2, a visual temporal Transformer incorporating multi-head cross-attention to jointly model short- and long-term lip-motion anomalies; it focuses exclusively on the lip region for fine-grained spatiotemporal feature learning and cross-frame dynamic anomaly perception. Contribution/Results: We introduce LipSyncTIMIT—the first benchmark dataset covering five state-of-the-art lip-sync models—to enable systematic evaluation. On LipSyncTIMIT and two public benchmarks, LIPINC-V2 achieves new state-of-the-art detection accuracy, particularly against high-fidelity lip-sync deepfakes. This work establishes a novel paradigm for lip-sync forgery detection and provides critical, diverse data support for advancing robust authentication methods.
📝 Abstract
Deepfakes are AI-generated media in which the original content is digitally altered to create convincing but manipulated images, videos, or audio. Among the various types of deepfakes, lip-syncing deepfakes are one of the most challenging deepfakes to detect. In these videos, a person's lip movements are synthesized to match altered or entirely new audio using AI models. Therefore, unlike other types of deepfakes, the artifacts in lip-syncing deepfakes are confined to the mouth region, making them more subtle and, thus harder to discern. In this paper, we propose LIPINC-V2, a novel detection framework that leverages a combination of vision temporal transformer with multihead cross-attention to detect lip-syncing deepfakes by identifying spatiotemporal inconsistencies in the mouth region. These inconsistencies appear across adjacent frames and persist throughout the video. Our model can successfully capture both short-term and long-term variations in mouth movement, enhancing its ability to detect these inconsistencies. Additionally, we created a new lip-syncing deepfake dataset, LipSyncTIMIT, which was generated using five state-of-the-art lip-syncing models to simulate real-world scenarios. Extensive experiments on our proposed LipSyncTIMIT dataset and two other benchmark deepfake datasets demonstrate that our model achieves state-of-the-art performance. The code and the dataset are available at https://github.com/skrantidatta/LIPINC-V2 .