🤖 AI Summary
Existing 3D human pose and shape (HPS) estimation methods suffer from poor generalization to real-world outdoor scenes due to the scarcity of in-the-wild motion-capture data and the limited diversity—particularly in identity and background fidelity—of synthetic CG-rendered data. To address this, we propose HumanWild, the first framework enabling joint generation of controllable outdoor human images and corresponding SMPL-X ground-truth 3D annotations, driven by text prompts or normal maps. Leveraging diffusion models augmented with a customized ControlNet and refined via SAM-based segmentation for noise suppression, HumanWild synthesizes WildBody—a high-quality dataset of 790K samples. We rigorously validate the complementary utility of generative data versus traditional CG data. Evaluated on six major benchmarks—including 3DPW, RICH, and EgoBody—our method significantly outperforms purely CG-trained approaches. The resulting HPS regressor achieves state-of-the-art generalization performance, eliminating reliance on real-world in-the-wild 3D annotations.
📝 Abstract
In this work, we show that synthetic data created by generative models is complementary to computer graphics (CG) rendered data for achieving remarkable generalization performance on diverse real-world scenes for 3D human pose and shape estimation (HPS). Specifically, we propose an effective approach based on recent diffusion models, termed HumanWild, which can effortlessly generate human images and corresponding 3D mesh annotations. We first collect a large-scale human-centric dataset with comprehensive annotations, e.g., text captions and surface normal images. Then, we train a customized ControlNet model upon this dataset to generate diverse human images and initial ground-truth labels. At the core of this step is that we can easily obtain numerous surface normal images from a 3D human parametric model, e.g., SMPL-X, by rendering the 3D mesh onto the image plane. As there exists inevitable noise in the initial labels, we then apply an off-the-shelf foundation segmentation model, i.e., SAM, to filter negative data samples. Our data generation pipeline is flexible and customizable to facilitate different real-world tasks, e.g., ego-centric scenes and perspective-distortion scenes. The generated dataset comprises 0.79M images with corresponding 3D annotations, covering versatile viewpoints, scenes, and human identities. We train various HPS regressors on top of the generated data and evaluate them on a wide range of benchmarks (3DPW, RICH, EgoBody, AGORA, SSP-3D) to verify the effectiveness of the generated data. By exclusively employing generative models, we generate large-scale in-the-wild human images and high-quality annotations, eliminating the need for real-world data collection.