UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Existing self-supervised methods for hand pose estimation are prone to interference from noisy pseudo-labels and often neglect fine-grained spatial relationships, leading to unstable training. To address these limitations, this work proposes a novel self-supervised framework that models pose uncertainty by constructing a probabilistic 3D point cloud feature space and employs conditional normalizing flows to generate multiple plausible pose hypotheses, thereby enabling feature interaction across multi-view and temporal dimensions. By incorporating multi-view consistency constraints, the proposed method achieves substantial performance gains on three challenging benchmarks, reducing the MPVPE metric by up to 37.8% compared to current self-supervised approaches, and demonstrates markedly improved robustness and training stability.

📝 Abstract

Manually annotating accurate 3D hand poses is extremely time-consuming and labor-intensive. Existing self-supervised hand pose estimation methods leverage the discrepancy between input images and rendered outputs, or multi-view consistency constraints, as the driving force to optimize networks and progressively refine pose accuracy. However, these methods are highly susceptible to noisy pseudo-labels and overlook the importance of fully exploiting fine-grained spatial correlations, which undermines the stability of model training. To address these issues, we propose UST-Hand, a self-supervised learning framework that estimates uncertainty distribution of hand pose and constructs a probabilistic point cloud feature space, which enables the complex spatiotemporal relationship modeling. UST-Hand employs a conditional normalizing flow model to capture hand pose distributions and samples diverse hypotheses, facilitating robust learning under noisy pseudo-labels supervision with enhanced stability. These multi-hypothesis are mapped to a unified probabilistic 3D point cloud space for multi-view and temporal feature interaction, comprehensively exploring hand motion patterns and fine-grained spatial correlations. Extensive experiments on three challenging datasets demonstrate that UST-Hand achieves state-of-the-art performance, outperforming existing self-supervised methods by up to 37.8% in Mean Per Vertex Position Error (MPVPE).

Problem

Research questions and friction points this paper is trying to address.

self-supervised hand pose estimation

noisy pseudo-labels

spatial correlations

model training stability

3D hand pose

Innovation

Methods, ideas, or system contributions that make the work stand out.

uncertainty-aware

spatiotemporal point cloud

self-supervised hand pose estimation