HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Joint 3D reconstruction of humans and scenes from sparse, uncalibrated multi-view images remains challenging—existing methods primarily target static outdoor scenes and exhibit limited geometric fidelity in complex, dynamic human activity scenarios. Method: We propose the first end-to-end feedforward network for unified human-scene geometric modeling. Our approach builds upon the DUNE image encoder, synergistically integrating MASt3R’s scene reconstruction capability with multi-HMR’s human pose prior. It introduces semantic-aware dense point cloud generation and jointly embeds person segmentation, DensePose-based cross-view correspondence, and multi-head depth prediction—eliminating reliance on post-hoc optimization. Results: Our method achieves state-of-the-art performance on EgoHumans and EgoExo4D benchmarks. Moreover, it demonstrates strong generalization to conventional multi-view stereo (MVS) and human pose regression tasks, validating its robustness and versatility across diverse 3D vision domains.

Technology Category

Application Category

📝 Abstract

Recovering the 3D geometry of a scene from a sparse set of uncalibrated images is a long-standing problem in computer vision. While recent learning-based approaches such as DUSt3R and MASt3R have demonstrated impressive results by directly predicting dense scene geometry, they are primarily trained on outdoor scenes with static environments and struggle to handle human-centric scenarios. In this work, we introduce HAMSt3R, an extension of MASt3R for joint human and scene 3D reconstruction from sparse, uncalibrated multi-view images. First, we exploit DUNE, a strong image encoder obtained by distilling, among others, the encoders from MASt3R and from a state-of-the-art Human Mesh Recovery (HMR) model, multi-HMR, for a better understanding of scene geometry and human bodies. Our method then incorporates additional network heads to segment people, estimate dense correspondences via DensePose, and predict depth in human-centric environments, enabling a more comprehensive 3D reconstruction. By leveraging the outputs of our different heads, HAMSt3R produces a dense point map enriched with human semantic information in 3D. Unlike existing methods that rely on complex optimization pipelines, our approach is fully feed-forward and efficient, making it suitable for real-world applications. We evaluate our model on EgoHumans and EgoExo4D, two challenging benchmarks con taining diverse human-centric scenarios. Additionally, we validate its generalization to traditional multi-view stereo and multi-view pose regression tasks. Our results demonstrate that our method can reconstruct humans effectively while preserving strong performance in general 3D reconstruction tasks, bridging the gap between human and scene understanding in 3D vision.

Problem

Research questions and friction points this paper is trying to address.

Reconstructing 3D scenes from sparse uncalibrated images

Handling human-centric scenarios in 3D reconstruction

Joint human and scene reconstruction from multi-view images

Innovation

Methods, ideas, or system contributions that make the work stand out.

HAMSt3R extends MASt3R for joint human-scene reconstruction

Uses DUNE encoder combining scene geometry and human understanding

Incorporates segmentation, DensePose, and depth prediction heads

🔎 Similar Papers

Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View