UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass

📅 2026-01-03

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the domain shift, geometric distortions, and misalignment commonly encountered when jointly reconstructing real-world scenes and 3D human structures due to reliance on synthetic data. The authors propose an end-to-end feedforward framework that simultaneously recovers metric-scale scene geometry, human point clouds, camera parameters, and SMPL body models in a single forward pass. Leveraging unlabeled real-world videos, the method innovatively integrates scene reconstruction with HMR priors through a two-stage training strategy—coarse localization using synthetic data followed by geometric refinement on real data—and employs a high-frequency detail distillation scheme to effectively bridge the sim-to-real domain gap. Experiments demonstrate that the approach achieves state-of-the-art performance in human-centric scene reconstruction and outperforms both optimization-based and pure HMR methods in global human motion estimation.

Technology Category

Application Category

📝 Abstract

We present UniSH, a unified, feed-forward framework for joint metric-scale 3D scene and human reconstruction. A key challenge in this domain is the scarcity of large-scale, annotated real-world data, forcing a reliance on synthetic datasets. This reliance introduces a significant sim-to-real domain gap, leading to poor generalization, low-fidelity human geometry, and poor alignment on in-the-wild videos. To address this, we propose an innovative training paradigm that effectively leverages unlabeled in-the-wild data. Our framework bridges strong, disparate priors from scene reconstruction and HMR, and is trained with two core components: (1) a robust distillation strategy to refine human surface details by distilling high-frequency details from an expert depth model, and (2) a two-stage supervision scheme, which first learns coarse localization on synthetic data, then fine-tunes on real data by directly optimizing the geometric correspondence between the SMPL mesh and the human point cloud. This approach enables our feed-forward model to jointly recover high-fidelity scene geometry, human point clouds, camera parameters, and coherent, metric-scale SMPL bodies, all in a single forward pass. Extensive experiments demonstrate that our model achieves state-of-the-art performance on human-centric scene reconstruction and delivers highly competitive results on global human motion estimation, comparing favorably against both optimization-based frameworks and HMR-only methods. Project page: https://murphylmf.github.io/UniSH/

Problem

Research questions and friction points this paper is trying to address.

3D reconstruction

sim-to-real gap

human-scene reconstruction

domain generalization

metric-scale recovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

feed-forward reconstruction

sim-to-real domain gap

knowledge distillation