HGFreNet: Hop-hybrid GraphFomer for 3D Human Pose Estimation with Trajectory Consistency in Frequency Domain

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

To address depth ambiguity and trajectory discontinuity in monocular video-based 2D-to-3D human pose lifting, this paper proposes a novel framework modeling global spatiotemporal joint correlations. Our method integrates a Hop-mixed Graph Attention (HGA) module to enhance feature aggregation over multi-hop skeletal neighborhoods; introduces, for the first time, explicit trajectory consistency constraints in the frequency domain to model long-range temporal coherence; and jointly optimizes spatial and temporal representations via a Transformer encoder coupled with an auxiliary 3D pose estimation network. Evaluated on Human3.6M and MPI-INF-3DHP, our approach achieves state-of-the-art performance—significantly improving both position accuracy (MPJPE) and temporal smoothness (ACCEL). These results empirically validate the effectiveness of frequency-domain modeling and multi-hop graph structures in ensuring 3D pose consistency.

Technology Category

Application Category

📝 Abstract

2D-to-3D human pose lifting is a fundamental challenge for 3D human pose estimation in monocular video, where graph convolutional networks (GCNs) and attention mechanisms have proven to be inherently suitable for encoding the spatial-temporal correlations of skeletal joints. However, depth ambiguity and errors in 2D pose estimation lead to incoherence in the 3D trajectory. Previous studies have attempted to restrict jitters in the time domain, for instance, by constraining the differences between adjacent frames while neglecting the global spatial-temporal correlations of skeletal joint motion. To tackle this problem, we design HGFreNet, a novel GraphFormer architecture with hop-hybrid feature aggregation and 3D trajectory consistency in the frequency domain. Specifically, we propose a hop-hybrid graph attention (HGA) module and a Transformer encoder to model global joint spatial-temporal correlations. The HGA module groups all $k$-hop neighbors of a skeletal joint into a hybrid group to enlarge the receptive field and applies the attention mechanism to discover the latent correlations of these groups globally. We then exploit global temporal correlations by constraining trajectory consistency in the frequency domain. To provide 3D information for depth inference across frames and maintain coherence over time, a preliminary network is applied to estimate the 3D pose. Extensive experiments were conducted on two standard benchmark datasets: Human3.6M and MPI-INF-3DHP. The results demonstrate that the proposed HGFreNet outperforms state-of-the-art (SOTA) methods in terms of positional accuracy and temporal consistency.

Problem

Research questions and friction points this paper is trying to address.

Addressing depth ambiguity in 2D-to-3D human pose lifting

Modeling global spatial-temporal correlations of skeletal joints

Ensuring 3D trajectory consistency in frequency domain

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hop-hybrid graph attention for global joint correlations

Transformer encoder models spatial-temporal relationships

Frequency domain constraints ensure trajectory consistency

🔎 Similar Papers

STGFormer: Spatio-Temporal GraphFormer for 3D Human Pose Estimation in Video