Towards Metric-Aware Multi-Person Mesh Recovery by Jointly Optimizing Human Crowd in Camera Space

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Single-image multi-person mesh recovery suffers from depth and scale inconsistency due to the scarcity of real-world training data with accurate 3D scene geometry. Method: This paper proposes Depth-Conditioned Translation Optimization (DTO), a novel framework that jointly optimizes the translational relationships of multiple people in camera space under depth constraints, enabling scene-consistent 3D layout reconstruction. Built upon a maximum-a-posteriori (MAP) formulation, DTO integrates anthropometric priors and monocular depth cues to jointly optimize pseudo-labels for all persons in an image—the first such approach for multi-person settings. Contributions/Results: We introduce DTO-Humans, a large-scale, scene-consistent dataset containing 560K images with an average of 4.8 persons per image. Additionally, we design Metric-Aware HMR, a network that end-to-end estimates metric-scale human meshes and camera parameters. Our method achieves state-of-the-art performance on both multi-person mesh recovery and relative depth inference.

Technology Category

Application Category

📝 Abstract
Multi-person human mesh recovery from a single image is a challenging task, hindered by the scarcity of in-the-wild training data. Prevailing in-the-wild human mesh pseudo-ground-truth (pGT) generation pipelines are single-person-centric, where each human is processed individually without joint optimization. This oversight leads to a lack of scene-level consistency, producing individuals with conflicting depths and scales within the same image. To address this, we introduce Depth-conditioned Translation Optimization (DTO), a novel optimization-based method that jointly refines the camera-space translations of all individuals in a crowd. By leveraging anthropometric priors on human height and depth cues from a monocular depth estimator, DTO solves for a scene-consistent placement of all subjects within a principled Maximum a posteriori (MAP) framework. Applying DTO to the 4D-Humans dataset, we construct DTO-Humans, a new large-scale pGT dataset of 0.56M high-quality, scene-consistent multi-person images, featuring dense crowds with an average of 4.8 persons per image. Furthermore, we propose Metric-Aware HMR, an end-to-end network that directly estimates human mesh and camera parameters in metric scale. This is enabled by a camera branch and a novel relative metric loss that enforces plausible relative scales. Extensive experiments demonstrate that our method achieves state-of-the-art performance on relative depth reasoning and human mesh recovery. Code and data will be released publicly.
Problem

Research questions and friction points this paper is trying to address.

Addressing scene-level inconsistency in multi-person mesh recovery from single images
Resolving conflicting depths and scales among individuals within crowded scenes
Overcoming scarcity of in-the-wild training data for multi-person mesh reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly optimizes camera-space translations of crowd
Uses anthropometric priors and depth cues
Enforces metric scale with relative metric loss
🔎 Similar Papers
No similar papers found.
K
Kaiwen Wang
Department of Electronic Engineering, Tsinghua University
K
Kaili Zheng
Department of Electronic Engineering, Tsinghua University
Yiming Shi
Yiming Shi
University of Electronic Science and Technology of China
Efficient AIParameter Efficient Fine TuningDiffusionMultimodal
C
Chenyi Guo
Department of Electronic Engineering, Tsinghua University
Ji Wu
Ji Wu
Tsinghua University
Artificial Intelligence,smart healthcaremachine learningpattern recognitionspeech recognition