🤖 AI Summary
To address the growing security risks posed by AI-generated human motion videos, this paper proposes a multimodal semantic embedding-based detection method that overcomes the limitations of conventional approaches relying on low-level visual cues (e.g., optical flow, texture). The method introduces a novel cross-modal discriminative paradigm operating at the human motion semantic level, integrating contrastive learning with temporal action modeling to achieve separability between real and synthetic videos in semantic embedding space. It further exhibits strong robustness against post-processing “whitewashing” attacks. We construct a dedicated benchmark dataset covering seven mainstream text-to-video diffusion models and evaluate our method on this new benchmark. Experimental results demonstrate significant performance gains over existing state-of-the-art methods—achieving high accuracy and superior cross-model generalization capability.
📝 Abstract
Full-blown AI-generated video generation continues its journey through the uncanny valley to produce content that is perceptually indistinguishable from reality. Intermixed with many exciting and creative applications are malicious applications that harm individuals, organizations, and democracies. We describe an effective and robust technique for distinguishing real from AI-generated human motion. This technique leverages a multi-modal semantic embedding, making it robust to the types of laundering that typically confound more low- to mid-level approaches. This method is evaluated against a custom-built dataset of video clips with human actions generated by seven text-to-video AI models and matching real footage.