🤖 AI Summary
To address depth ambiguity and trajectory discontinuity in monocular video-based 2D-to-3D human pose lifting, this paper proposes a novel framework modeling global spatiotemporal joint correlations. Our method integrates a Hop-mixed Graph Attention (HGA) module to enhance feature aggregation over multi-hop skeletal neighborhoods; introduces, for the first time, explicit trajectory consistency constraints in the frequency domain to model long-range temporal coherence; and jointly optimizes spatial and temporal representations via a Transformer encoder coupled with an auxiliary 3D pose estimation network. Evaluated on Human3.6M and MPI-INF-3DHP, our approach achieves state-of-the-art performance—significantly improving both position accuracy (MPJPE) and temporal smoothness (ACCEL). These results empirically validate the effectiveness of frequency-domain modeling and multi-hop graph structures in ensuring 3D pose consistency.
📝 Abstract
2D-to-3D human pose lifting is a fundamental challenge for 3D human pose estimation in monocular video, where graph convolutional networks (GCNs) and attention mechanisms have proven to be inherently suitable for encoding the spatial-temporal correlations of skeletal joints. However, depth ambiguity and errors in 2D pose estimation lead to incoherence in the 3D trajectory. Previous studies have attempted to restrict jitters in the time domain, for instance, by constraining the differences between adjacent frames while neglecting the global spatial-temporal correlations of skeletal joint motion. To tackle this problem, we design HGFreNet, a novel GraphFormer architecture with hop-hybrid feature aggregation and 3D trajectory consistency in the frequency domain. Specifically, we propose a hop-hybrid graph attention (HGA) module and a Transformer encoder to model global joint spatial-temporal correlations. The HGA module groups all $k$-hop neighbors of a skeletal joint into a hybrid group to enlarge the receptive field and applies the attention mechanism to discover the latent correlations of these groups globally. We then exploit global temporal correlations by constraining trajectory consistency in the frequency domain. To provide 3D information for depth inference across frames and maintain coherence over time, a preliminary network is applied to estimate the 3D pose. Extensive experiments were conducted on two standard benchmark datasets: Human3.6M and MPI-INF-3DHP. The results demonstrate that the proposed HGFreNet outperforms state-of-the-art (SOTA) methods in terms of positional accuracy and temporal consistency.