🤖 AI Summary
To address key challenges in video-based human pose estimation—including label sparsity, weak long-term temporal modeling, and decoupled spatiotemporal feature learning—this paper proposes STDPose, a novel framework. Methodologically, STDPose introduces: (1) a dynamic-aware masking mechanism that explicitly captures long-range motion context; (2) a spatiotemporal representation encoding and aggregation module that jointly optimizes heatmap structural priors and visual features; and (3) a pseudo-label-driven semi-supervised training strategy. Evaluated under extreme label scarcity (only 26.7% of frames annotated), STDPose achieves state-of-the-art performance across three major benchmarks, matching the accuracy of fully supervised methods. It is the first approach to enable high-fidelity pose propagation and robust temporal consistency modeling under ultra-sparse supervision, thereby bridging the gap between semi-supervised and fully supervised pose estimation.
📝 Abstract
Human pose estimation in videos remains a challenge, largely due to the reliance on extensive manual annotation of large datasets, which is expensive and labor-intensive. Furthermore, existing approaches often struggle to capture long-range temporal dependencies and overlook the complementary relationship between temporal pose heatmaps and visual features. To address these limitations, we introduce STDPose, a novel framework that enhances human pose estimation by learning spatiotemporal dynamics in sparsely-labeled videos. STDPose incorporates two key innovations: 1) A novel Dynamic-Aware Mask to capture long-range motion context, allowing for a nuanced understanding of pose changes. 2) A system for encoding and aggregating spatiotemporal representations and motion dynamics to effectively model spatiotemporal relationships, improving the accuracy and robustness of pose estimation. STDPose establishes a new performance benchmark for both video pose propagation (i.e., propagating pose annotations from labeled frames to unlabeled frames) and pose estimation tasks, across three large-scale evaluation datasets. Additionally, utilizing pseudo-labels generated by pose propagation, STDPose achieves competitive performance with only 26.7% labeled data.