Reconstructing People, Places, and Cameras

📅 2024-12-23
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the joint reconstruction of multi-person 3D meshes, scene point clouds, and camera poses from sparse, uncalibrated multi-view images, all within a unified metric world coordinate system that explicitly models spatial relationships among humans, environment, and cameras. Methodologically, it is the first to embed the SMPL human statistical model into a Structure-from-Motion (SfM) framework, leveraging human priors to impose absolute scale constraints. A multi-module joint optimization scheme is introduced to co-estimate human meshes, scene geometry, and camera parameters, synergistically integrating data-driven reconstruction with classical SfM principles. Evaluated on EgoHumans and EgoExo4D, the method reduces world-coordinate human localization error to 1.04 m and 0.56 m, respectively, and improves camera pose accuracy (RRA@15) by 20.3%, significantly enhancing overall geometric fidelity and cross-modal consistency.

Technology Category

Application Category

📝 Abstract
We present"Humans and Structure from Motion"(HSfM), a method for jointly reconstructing multiple human meshes, scene point clouds, and camera parameters in a metric world coordinate system from a sparse set of uncalibrated multi-view images featuring people. Our approach combines data-driven scene reconstruction with the traditional Structure-from-Motion (SfM) framework to achieve more accurate scene reconstruction and camera estimation, while simultaneously recovering human meshes. In contrast to existing scene reconstruction and SfM methods that lack metric scale information, our method estimates approximate metric scale by leveraging a human statistical model. Furthermore, it reconstructs multiple human meshes within the same world coordinate system alongside the scene point cloud, effectively capturing spatial relationships among individuals and their positions in the environment. We initialize the reconstruction of humans, scenes, and cameras using robust foundational models and jointly optimize these elements. This joint optimization synergistically improves the accuracy of each component. We compare our method to existing approaches on two challenging benchmarks, EgoHumans and EgoExo4D, demonstrating significant improvements in human localization accuracy within the world coordinate frame (reducing error from 3.51m to 1.04m in EgoHumans and from 2.9m to 0.56m in EgoExo4D). Notably, our results show that incorporating human data into the SfM pipeline improves camera pose estimation (e.g., increasing RRA@15 by 20.3% on EgoHumans). Additionally, qualitative results show that our approach improves overall scene reconstruction quality. Our code is available at: https://github.com/hongsukchoi/HSfM_RELEASE
Problem

Research questions and friction points this paper is trying to address.

Jointly reconstruct human meshes, scene point clouds, and camera parameters from uncalibrated multi-view images.
Estimate metric scale using human statistical model for accurate scene and camera reconstruction.
Improve human localization accuracy and camera pose estimation in world coordinate systems.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines data-driven and SfM for joint reconstruction
Estimates metric scale using human statistical model
Jointly optimizes humans, scenes, and cameras