SpatioTemporal Learning for Human Pose Estimation in Sparsely-Labeled Videos

📅 2025-01-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address key challenges in video-based human pose estimation—including label sparsity, weak long-term temporal modeling, and decoupled spatiotemporal feature learning—this paper proposes STDPose, a novel framework. Methodologically, STDPose introduces: (1) a dynamic-aware masking mechanism that explicitly captures long-range motion context; (2) a spatiotemporal representation encoding and aggregation module that jointly optimizes heatmap structural priors and visual features; and (3) a pseudo-label-driven semi-supervised training strategy. Evaluated under extreme label scarcity (only 26.7% of frames annotated), STDPose achieves state-of-the-art performance across three major benchmarks, matching the accuracy of fully supervised methods. It is the first approach to enable high-fidelity pose propagation and robust temporal consistency modeling under ultra-sparse supervision, thereby bridging the gap between semi-supervised and fully supervised pose estimation.

Technology Category

Application Category

📝 Abstract

Human pose estimation in videos remains a challenge, largely due to the reliance on extensive manual annotation of large datasets, which is expensive and labor-intensive. Furthermore, existing approaches often struggle to capture long-range temporal dependencies and overlook the complementary relationship between temporal pose heatmaps and visual features. To address these limitations, we introduce STDPose, a novel framework that enhances human pose estimation by learning spatiotemporal dynamics in sparsely-labeled videos. STDPose incorporates two key innovations: 1) A novel Dynamic-Aware Mask to capture long-range motion context, allowing for a nuanced understanding of pose changes. 2) A system for encoding and aggregating spatiotemporal representations and motion dynamics to effectively model spatiotemporal relationships, improving the accuracy and robustness of pose estimation. STDPose establishes a new performance benchmark for both video pose propagation (i.e., propagating pose annotations from labeled frames to unlabeled frames) and pose estimation tasks, across three large-scale evaluation datasets. Additionally, utilizing pseudo-labels generated by pose propagation, STDPose achieves competitive performance with only 26.7% labeled data.

Problem

Research questions and friction points this paper is trying to address.

Human Pose Recognition

Limited Labeled Data

Video Understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

STDPose

pose estimation

limited annotations

🔎 Similar Papers

No similar papers found.

Authors to Follow