CondiMen: Conditional Multi-Person Mesh Recovery

📅 2024-12-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses multi-person 3D mesh reconstruction from a single image, jointly solving human detection, 3D pose and shape estimation, camera intrinsic calibration, and depth prediction. To tackle inherent uncertainties—including scale-depth ambiguity, information loss in 2D projection, and inter-person correlations—we introduce Bayesian generative modeling to this task for the first time. We propose a conditional variational autoencoder (CVAE)-based framework that jointly models the posterior distribution over SMPL parameters, camera intrinsics, and per-person depth as latent variables. Our method supports test-time integration of shape priors and multi-view constraints, enables uncertainty quantification for both pose and shape, and permits real-time sampling of the most probable solution. Evaluated on multiple benchmarks, it achieves or surpasses state-of-the-art performance, delivering significant improvements in reconstruction accuracy and robustness.

Technology Category

Application Category

📝 Abstract
Multi-person human mesh recovery (HMR) consists in detecting all individuals in a given input image, and predicting the body shape, pose, and 3D location for each detected person. The dominant approaches to this task rely on neural networks trained to output a single prediction for each detected individual. In contrast, we propose CondiMen, a method that outputs a joint parametric distribution over likely poses, body shapes, intrinsics and distances to the camera, using a Bayesian network. This approach offers several advantages. First, a probability distribution can handle some inherent ambiguities of this task -- such as the uncertainty between a person's size and their distance to the camera, or simply the loss of information when projecting 3D data onto the 2D image plane. Second, the output distribution can be combined with additional information to produce better predictions, by using e.g. known camera or body shape parameters, or by exploiting multi-view observations. Third, one can efficiently extract the most likely predictions from the output distribution, making our proposed approach suitable for real-time applications. Empirically we find that our model i) achieves performance on par with or better than the state-of-the-art, ii) captures uncertainties and correlations inherent in pose estimation and iii) can exploit additional information at test time, such as multi-view consistency or body shape priors. CondiMen spices up the modeling of ambiguity, using just the right ingredients on hand.
Problem

Research questions and friction points this paper is trying to address.

Estimating 3D human poses and shapes from 2D images with ambiguity handling
Modeling joint probability distributions for multi-person mesh recovery
Enhancing predictions using additional data like camera parameters or multi-view inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian network for joint parametric distribution
Handles ambiguities via probability distribution
Exploits multi-view and shape priors