PCIE_Pose Solution for EgoExo4D Pose and Proficiency Estimation Challenge

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the joint estimation of 21-joint 3D hand pose, full-body pose, and demonstrator proficiency from first-person RGB videos in the EgoExo4D dataset—challenged by fine-grained motions, frequent occlusions, and dynamic cross-view scenarios. We propose Hand Pose Vision Transformer (HP-ViT+), the first architecture to integrate ViT and CNN via weighted multimodal spatiotemporal feature encoding. Additionally, we design a Transformer-based classification head, unifying pose estimation and proficiency classification for the first time. Our method achieves CVPR 2025 dual championships: 8.31 mm PA-MPJPE for hand pose and 11.25 mm MPJPE for body pose. Furthermore, it attains a state-of-the-art Top-1 proficiency classification accuracy of 0.53.

Technology Category

Application Category

📝 Abstract
This report introduces our team's (PCIE_EgoPose) solutions for the EgoExo4D Pose and Proficiency Estimation Challenges at CVPR2025. Focused on the intricate task of estimating 21 3D hand joints from RGB egocentric videos, which are complicated by subtle movements and frequent occlusions, we developed the Hand Pose Vision Transformer (HP-ViT+). This architecture synergizes a Vision Transformer and a CNN backbone, using weighted fusion to refine the hand pose predictions. For the EgoExo4D Body Pose Challenge, we adopted a multimodal spatio-temporal feature integration strategy to address the complexities of body pose estimation across dynamic contexts. Our methods achieved remarkable performance: 8.31 PA-MPJPE in the Hand Pose Challenge and 11.25 MPJPE in the Body Pose Challenge, securing championship titles in both competitions. We extended our pose estimation solutions to the Proficiency Estimation task, applying core technologies such as transformer-based architectures. This extension enabled us to achieve a top-1 accuracy of 0.53, a SOTA result, in the Demonstrator Proficiency Estimation competition.
Problem

Research questions and friction points this paper is trying to address.

Estimating 21 3D hand joints from RGB egocentric videos
Addressing body pose estimation in dynamic contexts
Applying pose estimation to demonstrator proficiency evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

HP-ViT+ combines Vision Transformer and CNN
Multimodal spatio-temporal feature integration for body pose
Transformer-based architectures for proficiency estimation
🔎 Similar Papers
No similar papers found.
F
Feng Chen
Lenovo Research
K
Kanokphan Lertniphonphan
Lenovo Research
Q
Qiancheng Yan
University of Chinese Academy of Sciences
Xiaohui Fan
Xiaohui Fan
Tsinghua University
J
Jun Xie
Lenovo Research
T
Tao Zhang
Tsinghua University
Zhepeng Wang
Zhepeng Wang
Applied Scientist at Amazon Stores Foundational AI
Large Language ModelsOn-device AISelf-supervised LearningQuantum Machine Learning