🤖 AI Summary
Monocular human mesh recovery (HMR) suffers from scale and depth ambiguities, hindering reconstruction of geometrically plausible 3D meshes with metric-scale (meter-level) dimensions and accurate global translation. This work formally establishes, for the first time, the necessity of the standard perspective projection model for metric-accurate HMR. We propose a novel end-to-end differentiable framework based on ray maps—geometric representations encoding 2D bounding boxes, camera parameters, and geometric constraints in a unified manner—eliminating the need for auxiliary regularization modules. Our method directly regresses human meshes with true metric scale and globally consistent 6DoF pose. It achieves metric-scale estimation of pose, shape, and global translation across diverse indoor and outdoor scenes. Extensive experiments demonstrate substantial improvements over prevailing sequential HMR approaches, achieving state-of-the-art performance.
📝 Abstract
We introduce MetricHMR (Metric Human Mesh Recovery), an approach for metric human mesh recovery with accurate global translation from monocular images. In contrast to existing HMR methods that suffer from severe scale and depth ambiguity, MetricHMR is able to produce geometrically reasonable body shape and global translation in the reconstruction results. To this end, we first systematically analyze previous HMR methods on camera models to emphasize the critical role of the standard perspective projection model in enabling metric-scale HMR. We then validate the acceptable ambiguity range of metric HMR under the standard perspective projection model. Finally, we contribute a novel approach that introduces a ray map based on the standard perspective projection to jointly encode bounding-box information, camera parameters, and geometric cues for End2End metric HMR without any additional metric-regularization modules. Extensive experiments demonstrate that our method achieves state-of-the-art performance, even compared with sequential HMR methods, in metric pose, shape, and global translation estimation across both indoor and in-the-wild scenarios.