🤖 AI Summary
Monocular 3D pose estimation yields camera-centered skeletal representations that exhibit strong viewpoint dependency, hindering cross-view kinematic analysis in health and sports science. To address this, we propose 3DPCNet—a model-agnostic pose normalization module that jointly integrates graph convolution (to encode local bone topology) and Transformer-based global context modeling via gated cross-attention, while self-supervisedly learning continuous 6D rotation prediction for SO(3)-aligned, body-centered canonical poses. Our method requires only synthetic rotational augmentations and a composite loss—no ground-truth 3D annotations. On MM-Fi, it reduces mean rotation error from >20° to 3.4° and mean joint position error from 64 mm to 47 mm. On TotalCapture, normalized poses yield acceleration signals highly consistent with ground-truth IMU measurements, significantly improving physical plausibility and cross-view comparability of kinematic analysis.
📝 Abstract
Monocular 3D pose estimators produce camera-centered skeletons, creating view-dependent kinematic signals that complicate comparative analysis in applications such as health and sports science. We present 3DPCNet, a compact, estimator-agnostic module that operates directly on 3D joint coordinates to rectify any input pose into a consistent, body-centered canonical frame. Its hybrid encoder fuses local skeletal features from a graph convolutional network with global context from a transformer via a gated cross-attention mechanism. From this representation, the model predicts a continuous 6D rotation that is mapped to an $SO(3)$ matrix to align the pose. We train the model in a self-supervised manner on the MM-Fi dataset using synthetically rotated poses, guided by a composite loss ensuring both accurate rotation and pose reconstruction. On the MM-Fi benchmark, 3DPCNet reduces the mean rotation error from over 20$^{circ}$ to 3.4$^{circ}$ and the Mean Per Joint Position Error from ~64 mm to 47 mm compared to a geometric baseline. Qualitative evaluations on the TotalCapture dataset further demonstrate that our method produces acceleration signals from video that show strong visual correspondence to ground-truth IMU sensor data, confirming that our module removes viewpoint variability to enable physically plausible motion analysis.