🤖 AI Summary
This work addresses pose-free, calibration-free 3D human reconstruction from casually captured single- or multi-view images. We propose the first end-to-end, pose-free multi-view fusion framework enabling second-level, high-fidelity, animatable 3D human reconstruction. Methodologically, we design a hierarchical Point-Image Transformer to jointly encode point cloud and image features; introduce a multimodal attention mechanism; adopt 3D Gaussian splats as a unified geometric and appearance representation; and enforce unsupervised geometric consistency to decouple reconstruction stages. Our approach achieves unified modeling for both single- and multi-view inputs—without requiring pose priors, camera parameters, or manual annotations—while supporting skinning, rigging, and animation-driven deformation. Extensive experiments on real and synthetic benchmarks demonstrate significant improvements over state-of-the-art methods in reconstruction fidelity, speed (sub-second inference), and practical applicability.
📝 Abstract
Reconstructing an animatable 3D human from casually captured images of an articulated subject without camera or human pose information is a practical yet challenging task due to view misalignment, occlusions, and the absence of structural priors. While optimization-based methods can produce high-fidelity results from monocular or multi-view videos, they require accurate pose estimation and slow iterative optimization, limiting scalability in unconstrained scenarios. Recent feed-forward approaches enable efficient single-image reconstruction but struggle to effectively leverage multiple input images to reduce ambiguity and improve reconstruction accuracy. To address these challenges, we propose PF-LHM, a large human reconstruction model that generates high-quality 3D avatars in seconds from one or multiple casually captured pose-free images. Our approach introduces an efficient Encoder-Decoder Point-Image Transformer architecture, which fuses hierarchical geometric point features and multi-view image features through multimodal attention. The fused features are decoded to recover detailed geometry and appearance, represented using 3D Gaussian splats. Extensive experiments on both real and synthetic datasets demonstrate that our method unifies single- and multi-image 3D human reconstruction, achieving high-fidelity and animatable 3D human avatars without requiring camera and human pose annotations. Code and models will be released to the public.