🤖 AI Summary
Single-image multi-person mesh recovery suffers from depth and scale inconsistency due to the scarcity of real-world training data with accurate 3D scene geometry.
Method: This paper proposes Depth-Conditioned Translation Optimization (DTO), a novel framework that jointly optimizes the translational relationships of multiple people in camera space under depth constraints, enabling scene-consistent 3D layout reconstruction. Built upon a maximum-a-posteriori (MAP) formulation, DTO integrates anthropometric priors and monocular depth cues to jointly optimize pseudo-labels for all persons in an image—the first such approach for multi-person settings.
Contributions/Results: We introduce DTO-Humans, a large-scale, scene-consistent dataset containing 560K images with an average of 4.8 persons per image. Additionally, we design Metric-Aware HMR, a network that end-to-end estimates metric-scale human meshes and camera parameters. Our method achieves state-of-the-art performance on both multi-person mesh recovery and relative depth inference.
📝 Abstract
Multi-person human mesh recovery from a single image is a challenging task, hindered by the scarcity of in-the-wild training data. Prevailing in-the-wild human mesh pseudo-ground-truth (pGT) generation pipelines are single-person-centric, where each human is processed individually without joint optimization. This oversight leads to a lack of scene-level consistency, producing individuals with conflicting depths and scales within the same image. To address this, we introduce Depth-conditioned Translation Optimization (DTO), a novel optimization-based method that jointly refines the camera-space translations of all individuals in a crowd. By leveraging anthropometric priors on human height and depth cues from a monocular depth estimator, DTO solves for a scene-consistent placement of all subjects within a principled Maximum a posteriori (MAP) framework. Applying DTO to the 4D-Humans dataset, we construct DTO-Humans, a new large-scale pGT dataset of 0.56M high-quality, scene-consistent multi-person images, featuring dense crowds with an average of 4.8 persons per image. Furthermore, we propose Metric-Aware HMR, an end-to-end network that directly estimates human mesh and camera parameters in metric scale. This is enabled by a camera branch and a novel relative metric loss that enforces plausible relative scales. Extensive experiments demonstrate that our method achieves state-of-the-art performance on relative depth reasoning and human mesh recovery. Code and data will be released publicly.