Towards Metric-Aware Multi-Person Mesh Recovery by Jointly Optimizing Human Crowd in Camera Space

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Single-image multi-person mesh recovery suffers from depth and scale inconsistency due to the scarcity of real-world training data with accurate 3D scene geometry. Method: This paper proposes Depth-Conditioned Translation Optimization (DTO), a novel framework that jointly optimizes the translational relationships of multiple people in camera space under depth constraints, enabling scene-consistent 3D layout reconstruction. Built upon a maximum-a-posteriori (MAP) formulation, DTO integrates anthropometric priors and monocular depth cues to jointly optimize pseudo-labels for all persons in an image—the first such approach for multi-person settings. Contributions/Results: We introduce DTO-Humans, a large-scale, scene-consistent dataset containing 560K images with an average of 4.8 persons per image. Additionally, we design Metric-Aware HMR, a network that end-to-end estimates metric-scale human meshes and camera parameters. Our method achieves state-of-the-art performance on both multi-person mesh recovery and relative depth inference.

Technology Category

Application Category

📝 Abstract

Multi-person human mesh recovery from a single image is a challenging task, hindered by the scarcity of in-the-wild training data. Prevailing in-the-wild human mesh pseudo-ground-truth (pGT) generation pipelines are single-person-centric, where each human is processed individually without joint optimization. This oversight leads to a lack of scene-level consistency, producing individuals with conflicting depths and scales within the same image. To address this, we introduce Depth-conditioned Translation Optimization (DTO), a novel optimization-based method that jointly refines the camera-space translations of all individuals in a crowd. By leveraging anthropometric priors on human height and depth cues from a monocular depth estimator, DTO solves for a scene-consistent placement of all subjects within a principled Maximum a posteriori (MAP) framework. Applying DTO to the 4D-Humans dataset, we construct DTO-Humans, a new large-scale pGT dataset of 0.56M high-quality, scene-consistent multi-person images, featuring dense crowds with an average of 4.8 persons per image. Furthermore, we propose Metric-Aware HMR, an end-to-end network that directly estimates human mesh and camera parameters in metric scale. This is enabled by a camera branch and a novel relative metric loss that enforces plausible relative scales. Extensive experiments demonstrate that our method achieves state-of-the-art performance on relative depth reasoning and human mesh recovery. Code and data will be released publicly.

Problem

Research questions and friction points this paper is trying to address.

Addressing scene-level inconsistency in multi-person mesh recovery from single images

Resolving conflicting depths and scales among individuals within crowded scenes

Overcoming scarcity of in-the-wild training data for multi-person mesh reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly optimizes camera-space translations of crowd

Uses anthropometric priors and depth cues

Enforces metric scale with relative metric loss

🔎 Similar Papers

No similar papers found.