🤖 AI Summary
To address scale variation, depth ambiguity, and perspective distortion—leading to spatial inconsistency and reprojection inaccuracies in reconstructing hundreds of 3D human poses and shapes from a single large-scale scene image—this paper proposes the Human-scene Virtual Interaction Point (HVIP) paradigm. HVIP establishes upright 3D/2D canonical spaces with corresponding normalization mechanisms, enabling test-time optimization-free reconstruction that generalizes across diverse fields of view (FoV). The method integrates iterative ground-aware cropping, HVIP-driven 2D localization, multi-scale feature fusion, and an end-to-end differentiable reconstruction network. Evaluated on our newly introduced LargeCrowd and SynCrowd benchmarks, the approach significantly outperforms state-of-the-art methods. It is the first to achieve crowd-level 3D human reconstruction with global geometric consistency, precise reprojection accuracy, and robust cross-FoV stability.
📝 Abstract
This paper focuses on spatially consistent hundreds of human pose and shape reconstruction from a single large-scene image with various human scales under arbitrary camera FoVs (Fields of View). Due to the small and highly varying 2D human scales, depth ambiguity, and perspective distortion, no existing methods can achieve globally consistent reconstruction with correct reprojection. To address these challenges, we first propose a new concept, Human-scene Virtual Interaction Point (HVIP), to convert the complex 3D human localization into 2D-pixel localization. We then extend it to RCR (Robust Crowd Reconstruction), which achieves globally consistent reconstruction and stable generalization on different camera FoVs without test-time optimization. To perceive humans in varying pixel sizes, we propose an Iterative Ground-aware Cropping to automatically crop the image and then merge the results. To eliminate the influence of the camera and cropping process during the reconstruction, we introduce a canonical Upright 3D Space and the corresponding Upright 2D Space. To link the canonical space and the camera space, we propose the Upright Normalization, which transforms the local crop input into the Upright 2D Space, and transforms the output from the Upright 3D Space into the unified camera space. Besides, we contribute two benchmark datasets, LargeCrowd and SynCrowd, for evaluating crowd reconstruction in large scenes. Experimental results demonstrate the effectiveness of the proposed method. The source code and data will be publicly available for research purposes.