LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

Single-image animatable 3D human reconstruction suffers from ambiguities in disentangling geometry, appearance, and deformation, as well as limited generalization across subjects and poses. This paper introduces the first feed-forward large model tailored for drivable human reconstruction, generating geometry, texture, and skinning weights in a 3D Gaussian Splatting representation directly from a single image within seconds—without post-processing. Key innovations include a head-region feature pyramid encoder, which significantly improves facial identity preservation and fine-detail fidelity, and a multimodal Transformer architecture that jointly encodes human spatial priors and aggregates cross-scale head-region features. Experiments demonstrate state-of-the-art performance in both reconstruction accuracy and cross-domain generalization, outperforming existing static reconstruction and optimization-based methods. The resulting representations support plug-and-play skeletal animation with standard rigging pipelines.

Technology Category

Application Category

📝 Abstract

Animatable 3D human reconstruction from a single image is a challenging problem due to the ambiguity in decoupling geometry, appearance, and deformation. Recent advances in 3D human reconstruction mainly focus on static human modeling, and the reliance of using synthetic 3D scans for training limits their generalization ability. Conversely, optimization-based video methods achieve higher fidelity but demand controlled capture conditions and computationally intensive refinement processes. Motivated by the emergence of large reconstruction models for efficient static reconstruction, we propose LHM (Large Animatable Human Reconstruction Model) to infer high-fidelity avatars represented as 3D Gaussian splatting in a feed-forward pass. Our model leverages a multimodal transformer architecture to effectively encode the human body positional features and image features with attention mechanism, enabling detailed preservation of clothing geometry and texture. To further boost the face identity preservation and fine detail recovery, we propose a head feature pyramid encoding scheme to aggregate multi-scale features of the head regions. Extensive experiments demonstrate that our LHM generates plausible animatable human in seconds without post-processing for face and hands, outperforming existing methods in both reconstruction accuracy and generalization ability.

Problem

Research questions and friction points this paper is trying to address.

Challenges in animatable 3D human reconstruction from single images.

Limitations of static human modeling and synthetic 3D scans.

Need for efficient, high-fidelity avatars with detailed geometry and texture.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 3D Gaussian splatting for high-fidelity avatars

Employs multimodal transformer with attention mechanism

Implements head feature pyramid for detail recovery

🔎 Similar Papers

No similar papers found.