HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly

πŸ“… 2025-07-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing deepfake detection methods lack fine-grained discrimination among human portrait video forgery types, resulting in insufficient interpretability and reliability. Method: This paper proposes HumanSAMβ€”a novel framework that categorizes human forgeries into three orthogonal anomaly dimensions: spatial, appearance, and motion. We introduce HFV, the first public benchmark explicitly designed for fine-grained human forgery type classification. HumanSAM employs a dual-branch architecture integrating spatiotemporal video understanding with geometrically grounded spatial depth features; it further incorporates geometric, semantic, and spatiotemporal consistency modeling, augmented by three domain-specific prior scores and a ranking-based confidence enhancement strategy. Contribution/Results: Extensive experiments demonstrate that HumanSAM achieves state-of-the-art performance on both binary detection and fine-grained multi-class forgery type classification, significantly outperforming existing methods. Ablation studies confirm the effectiveness and robustness of each component, particularly under challenging generative video conditions.

Technology Category

Application Category

πŸ“ Abstract
Numerous synthesized videos from generative models, especially human-centric ones that simulate realistic human actions, pose significant threats to human information security and authenticity. While progress has been made in binary forgery video detection, the lack of fine-grained understanding of forgery types raises concerns regarding both reliability and interpretability, which are critical for real-world applications. To address this limitation, we propose HumanSAM, a new framework that builds upon the fundamental challenges of video generation models. Specifically, HumanSAM aims to classify human-centric forgeries into three distinct types of artifacts commonly observed in generated content: spatial, appearance, and motion anomaly.To better capture the features of geometry, semantics and spatiotemporal consistency, we propose to generate the human forgery representation by fusing two branches of video understanding and spatial depth. We also adopt a rank-based confidence enhancement strategy during the training process to learn more robust representation by introducing three prior scores. For training and evaluation, we construct the first public benchmark, the Human-centric Forgery Video (HFV) dataset, with all types of forgeries carefully annotated semi-automatically. In our experiments, HumanSAM yields promising results in comparison with state-of-the-art methods, both in binary and multi-class forgery classification.
Problem

Research questions and friction points this paper is trying to address.

Classify human-centric forgery videos into spatial, appearance, motion anomalies
Improve reliability and interpretability of forgery video detection
Address lack of fine-grained understanding of forgery types
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses video understanding and spatial depth
Uses rank-based confidence enhancement strategy
Introduces Human-centric Forgery Video dataset
πŸ”Ž Similar Papers
No similar papers found.
C
Chang Liu
National University of Defense Technology
Yunfan Ye
Yunfan Ye
National University of Defense Technology
Low-level VisionComputer GraphicsEdge Detection
F
Fan Zhang
National University of Defense Technology
Q
Qingyang Zhou
National University of Defense Technology
Y
Yuchuan Luo
National University of Defense Technology
Z
Zhiping Cai
National University of Defense Technology