SpatioTemporal Learning for Human Pose Estimation in Sparsely-Labeled Videos

📅 2025-01-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address key challenges in video-based human pose estimation—including label sparsity, weak long-term temporal modeling, and decoupled spatiotemporal feature learning—this paper proposes STDPose, a novel framework. Methodologically, STDPose introduces: (1) a dynamic-aware masking mechanism that explicitly captures long-range motion context; (2) a spatiotemporal representation encoding and aggregation module that jointly optimizes heatmap structural priors and visual features; and (3) a pseudo-label-driven semi-supervised training strategy. Evaluated under extreme label scarcity (only 26.7% of frames annotated), STDPose achieves state-of-the-art performance across three major benchmarks, matching the accuracy of fully supervised methods. It is the first approach to enable high-fidelity pose propagation and robust temporal consistency modeling under ultra-sparse supervision, thereby bridging the gap between semi-supervised and fully supervised pose estimation.

Technology Category

Application Category

📝 Abstract
Human pose estimation in videos remains a challenge, largely due to the reliance on extensive manual annotation of large datasets, which is expensive and labor-intensive. Furthermore, existing approaches often struggle to capture long-range temporal dependencies and overlook the complementary relationship between temporal pose heatmaps and visual features. To address these limitations, we introduce STDPose, a novel framework that enhances human pose estimation by learning spatiotemporal dynamics in sparsely-labeled videos. STDPose incorporates two key innovations: 1) A novel Dynamic-Aware Mask to capture long-range motion context, allowing for a nuanced understanding of pose changes. 2) A system for encoding and aggregating spatiotemporal representations and motion dynamics to effectively model spatiotemporal relationships, improving the accuracy and robustness of pose estimation. STDPose establishes a new performance benchmark for both video pose propagation (i.e., propagating pose annotations from labeled frames to unlabeled frames) and pose estimation tasks, across three large-scale evaluation datasets. Additionally, utilizing pseudo-labels generated by pose propagation, STDPose achieves competitive performance with only 26.7% labeled data.
Problem

Research questions and friction points this paper is trying to address.

Human Pose Recognition
Limited Labeled Data
Video Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

STDPose
pose estimation
limited annotations
🔎 Similar Papers
No similar papers found.
Y
Yingying Jiao
College of Computer Science and Technology, Jilin University; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University
Z
Zhigang Wang
College of Computer Science and Technology, Zhejiang Gongshang University
S
Sifan Wu
College of Computer Science and Technology, Jilin University; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University
Shaojing Fan
Shaojing Fan
Department of Electrical and Computer Engineering, National University of Singapore
Cognitive VisionComputer VisionExperimental Psychology
Zhenguang Liu
Zhenguang Liu
Zhejiang University
BlockchainSmart Contract SecurityMultimedia
Z
Zhuoyue Xu
College of Computer Science and Technology, Zhejiang Gongshang University
Z
Zheqi Wu
College of Computer Science and Technology, Zhejiang Gongshang University