Dual-stream Spatio-Temporal GCN-Transformer Network for 3D Human Pose Estimation

📅 2026-04-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

186K/year
🤖 AI Summary
This work addresses the limitations of existing Transformer-based methods for 3D human pose estimation, which struggle to effectively model local skeletal structures and inter-channel dependencies, leading to insufficient fusion of global and local features. To overcome this, the authors propose MixTGFormer, a novel dual-stream network that innovatively integrates Graph Convolutional Networks (GCNs) into the Transformer architecture. The core component is a spatio-temporal Mixformer module augmented with Squeeze-and-Excitation channel attention, enabling synergistic modeling of local-global relationships and efficient feature fusion. Evaluated on standard benchmarks, the method achieves state-of-the-art performance with P1 errors of 37.6 mm on Human3.6M and 15.7 mm on MPI-INF-3DHP.

Technology Category

Application Category

📝 Abstract
3D human pose estimation is a classic and important research direction in the field of computer vision. In recent years, Transformer-based methods have made significant progress in lifting 2D to 3D human pose estimation. However, these methods primarily focus on modeling global temporal and spatial relationships, neglecting local skeletal relationships and the information interaction between different channels. Therefore, we have proposed a novel method,the Dual-stream Spatio-temporal GCN-Transformer Network (MixTGFormer). This method models the spatial and temporal relationships of human skeletons simultaneously through two parallel channels, achieving effective fusion of global and local features. The core of MixTGFormer is composed of stacked Mixformers. Specifically, the Mixformer includes the Mixformer Block and the Squeeze-and-Excitation Layer ( SE Layer). It first extracts and fuses various information of human skeletons through two parallel Mixformer Blocks with different modes. Then, it further supplements the fused information through the SE Layer. The Mixformer Block integrates Graph Convolutional Networks (GCN) into the Transformer, enhancing both local and global information utilization. Additionally, we further implement its temporal and spatial forms to extract both spatial and temporal relationships. We extensively evaluated our model on two benchmark datasets (Human3.6M and MPI-INF-3DHP). The experimental results showed that, compared to other methods, our MixTGFormer achieved state-of-the-art results, with P1 errors of 37.6mm and 15.7mm on these datasets, respectively.
Problem

Research questions and friction points this paper is trying to address.

3D human pose estimation
local skeletal relationships
information interaction
global temporal and spatial relationships
channel interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stream
GCN-Transformer
Spatio-Temporal Modeling
Mixformer
3D Human Pose Estimation
J
Jiawen Duan
School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Liuxia, Hangzhou 310023, China
Jian Xiang
Jian Xiang
UNC Charlotte
Formal methods for SecurityInformation-flow analysisCyber-Physical System
Zhiqiang Li
Zhiqiang Li
University of Nebraska-Lincoln
L
Linlin Xue
School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Liuxia, Hangzhou 310023, China
W
Wan Xiang
School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Liuxia, Hangzhou 310023, China