PoseMoE: Mixture-of-Experts Network for Monocular 3D Human Pose Estimation

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing lifting methods encode 2D pose and depth into entangled feature spaces, causing depth uncertainty to corrupt 2D features and thereby limiting 3D pose estimation accuracy. To address this, we propose a stage-wise decoupled encoding framework: (1) a Mixture-of-Experts (MoE) architecture with expert specialization explicitly separates 2D pose refinement from depth estimation; (2) a cross-expert knowledge aggregation module enables bidirectional spatiotemporal contextual enhancement between 2D and depth representations. This work is the first to jointly unify feature disentanglement, expert collaboration, and temporal modeling within monocular 3D human pose estimation. Extensive experiments demonstrate state-of-the-art performance on Human3.6M, MPI-INF-3DHP, and 3DPW—significantly outperforming prevailing lifting methods in both accuracy and robustness.

Technology Category

Application Category

📝 Abstract
The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: (1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. (2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.
Problem

Research questions and friction points this paper is trying to address.

Disentangles 2D pose and depth feature encoding to reduce uncertainty
Refines depth features separately before combining with 2D pose information
Improves monocular 3D human pose estimation accuracy via expert modules
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-experts network disentangles 2D pose and depth features
Cross-expert aggregation module enhances spatio-temporal contextual information
Specialized experts refine 2D pose and learn depth features separately
🔎 Similar Papers
No similar papers found.
M
Mengyuan Liu
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
Jiajie Liu
Jiajie Liu
Peking University
Computer Vision
J
Jinyan Zhang
State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
W
Wenhao Li
College of Computing and Data Science, Nanyang Technological University, Singapore 639798
Junsong Yuan
Junsong Yuan
State University of New York at Buffalo
computer visionvideo analyticsaction and gesture analysismultimediapattern recognition