3DPCNet: Pose Canonicalization for Robust Viewpoint-Invariant 3D Kinematic Analysis from Monocular RGB cameras

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Monocular 3D pose estimation yields camera-centered skeletal representations that exhibit strong viewpoint dependency, hindering cross-view kinematic analysis in health and sports science. To address this, we propose 3DPCNet—a model-agnostic pose normalization module that jointly integrates graph convolution (to encode local bone topology) and Transformer-based global context modeling via gated cross-attention, while self-supervisedly learning continuous 6D rotation prediction for SO(3)-aligned, body-centered canonical poses. Our method requires only synthetic rotational augmentations and a composite loss—no ground-truth 3D annotations. On MM-Fi, it reduces mean rotation error from >20° to 3.4° and mean joint position error from 64 mm to 47 mm. On TotalCapture, normalized poses yield acceleration signals highly consistent with ground-truth IMU measurements, significantly improving physical plausibility and cross-view comparability of kinematic analysis.

Technology Category

Application Category

📝 Abstract
Monocular 3D pose estimators produce camera-centered skeletons, creating view-dependent kinematic signals that complicate comparative analysis in applications such as health and sports science. We present 3DPCNet, a compact, estimator-agnostic module that operates directly on 3D joint coordinates to rectify any input pose into a consistent, body-centered canonical frame. Its hybrid encoder fuses local skeletal features from a graph convolutional network with global context from a transformer via a gated cross-attention mechanism. From this representation, the model predicts a continuous 6D rotation that is mapped to an $SO(3)$ matrix to align the pose. We train the model in a self-supervised manner on the MM-Fi dataset using synthetically rotated poses, guided by a composite loss ensuring both accurate rotation and pose reconstruction. On the MM-Fi benchmark, 3DPCNet reduces the mean rotation error from over 20$^{circ}$ to 3.4$^{circ}$ and the Mean Per Joint Position Error from ~64 mm to 47 mm compared to a geometric baseline. Qualitative evaluations on the TotalCapture dataset further demonstrate that our method produces acceleration signals from video that show strong visual correspondence to ground-truth IMU sensor data, confirming that our module removes viewpoint variability to enable physically plausible motion analysis.
Problem

Research questions and friction points this paper is trying to address.

Correcting view-dependent 3D poses into a consistent canonical frame
Enabling robust viewpoint-invariant 3D kinematic analysis from monocular RGB
Removing viewpoint variability for physically plausible motion analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pose canonicalization via body-centered frame alignment
Hybrid encoder fuses graph and transformer features
Self-supervised training with composite loss function
🔎 Similar Papers
No similar papers found.
T
Tharindu Ekanayake
Center for Machine Vision and Signal Analysis (CMVS), University of Oulu, Finland
Constantino Álvarez Casado
Constantino Álvarez Casado
Postdoctoral Researcher, University of Oulu
Computer VisionMachine LearningDeep LearningHuman SensingDigital Signal Processing
M
Miguel Bordallo López
Center for Machine Vision and Signal Analysis (CMVS), University of Oulu, Finland