PF-LHM: 3D Animatable Avatar Reconstruction from Pose-free Articulated Human Images

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses pose-free, calibration-free 3D human reconstruction from casually captured single- or multi-view images. We propose the first end-to-end, pose-free multi-view fusion framework enabling second-level, high-fidelity, animatable 3D human reconstruction. Methodologically, we design a hierarchical Point-Image Transformer to jointly encode point cloud and image features; introduce a multimodal attention mechanism; adopt 3D Gaussian splats as a unified geometric and appearance representation; and enforce unsupervised geometric consistency to decouple reconstruction stages. Our approach achieves unified modeling for both single- and multi-view inputs—without requiring pose priors, camera parameters, or manual annotations—while supporting skinning, rigging, and animation-driven deformation. Extensive experiments on real and synthetic benchmarks demonstrate significant improvements over state-of-the-art methods in reconstruction fidelity, speed (sub-second inference), and practical applicability.

Technology Category

Application Category

📝 Abstract

Reconstructing an animatable 3D human from casually captured images of an articulated subject without camera or human pose information is a practical yet challenging task due to view misalignment, occlusions, and the absence of structural priors. While optimization-based methods can produce high-fidelity results from monocular or multi-view videos, they require accurate pose estimation and slow iterative optimization, limiting scalability in unconstrained scenarios. Recent feed-forward approaches enable efficient single-image reconstruction but struggle to effectively leverage multiple input images to reduce ambiguity and improve reconstruction accuracy. To address these challenges, we propose PF-LHM, a large human reconstruction model that generates high-quality 3D avatars in seconds from one or multiple casually captured pose-free images. Our approach introduces an efficient Encoder-Decoder Point-Image Transformer architecture, which fuses hierarchical geometric point features and multi-view image features through multimodal attention. The fused features are decoded to recover detailed geometry and appearance, represented using 3D Gaussian splats. Extensive experiments on both real and synthetic datasets demonstrate that our method unifies single- and multi-image 3D human reconstruction, achieving high-fidelity and animatable 3D human avatars without requiring camera and human pose annotations. Code and models will be released to the public.

Problem

Research questions and friction points this paper is trying to address.

Reconstruct 3D animatable avatars from pose-free images

Overcome view misalignment and occlusion without pose data

Enable efficient multi-image fusion for accurate reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Encoder-Decoder Point-Image Transformer architecture

Fuses point and image features via attention

Uses 3D Gaussian splats for detailed geometry

🔎 Similar Papers

No similar papers found.

World Labs

$250,000-$350,000 base salary (good-faith estimate for San Francisco Bay Area upon hire; actual offer based on experience, skills, and qualifications)

San Francisco / San Francisco Office, San Francisco, California, United States

3D Avatar Research and Development - PICO Perception - San Jose

ByteDance

San Jose

Authors to Follow