๐ค AI Summary
Existing parametric human models (e.g., SMPL) lack biomechanical fidelity; while SKEL provides anatomically accurate skeletal structure, its parameter estimation is hindered by data scarcity, occlusion-prone multi-view ambiguity, and joint complexity. To address these challenges, we propose SKEL-CF: a Transformer-based coarse-to-fine encoder-decoder frameworkโwhere the encoder produces initial estimates of camera and SKEL parameters, and the decoder refines them hierarchically. We explicitly model camera geometry to mitigate depth and scale ambiguities. Furthermore, we introduce 4DHuman-SKEL, the first large-scale 4D dataset specifically designed for anatomical skeleton estimation. Evaluated on MOYO, SKEL-CF achieves 85.0 mm MPJPE and 51.4 mm PA-MPJPE, substantially outperforming HSMR (104.5 / 79.6), and marks the first method enabling jointly high-accuracy, anatomically plausible reconstruction of both skeletal structure and surface geometry.
๐ Abstract
Parametric 3D human models such as SMPL have driven significant advances in human pose and shape estimation, yet their simplified kinematics limit biomechanical realism. The recently proposed SKEL model addresses this limitation by re-rigging SMPL with an anatomically accurate skeleton. However, estimating SKEL parameters directly remains challenging due to limited training data, perspective ambiguities, and the inherent complexity of human articulation. We introduce SKEL-CF, a coarse-to-fine framework for SKEL parameter estimation. SKEL-CF employs a transformer-based encoder-decoder architecture, where the encoder predicts coarse camera and SKEL parameters, and the decoder progressively refines them in successive layers. To ensure anatomically consistent supervision, we convert the existing SMPL-based dataset 4DHuman into a SKEL-aligned version, 4DHuman-SKEL, providing high-quality training data for SKEL estimation. In addition, to mitigate depth and scale ambiguities, we explicitly incorporate camera modeling into the SKEL-CF pipeline and demonstrate its importance across diverse viewpoints. Extensive experiments validate the effectiveness of the proposed design. On the challenging MOYO dataset, SKEL-CF achieves 85.0 MPJPE / 51.4 PA-MPJPE, significantly outperforming the previous SKEL-based state-of-the-art HSMR (104.5 / 79.6). These results establish SKEL-CF as a scalable and anatomically faithful framework for human motion analysis, bridging the gap between computer vision and biomechanics. Our implementation is available on the project page: https://pokerman8.github.io/SKEL-CF/.