IDOL: Instant Photorealistic 3D Human Creation from a Single Image

📅 2024-12-19
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenging problem of reconstructing animatable, high-fidelity 3D human avatars from a single input image. We introduce HuGe100K, the first large-scale synthetic dataset specifically designed for portrait-oriented human generation. Our method employs a feed-forward Vision Transformer that explicitly decouples pose, shape deformation, clothing geometry, and texture representation, integrated with generative multi-view synthesis and 3D Gaussian splatting. The framework enables end-to-end, zero-post-processing reconstruction: given only one image, it produces a fully drivable 3D Gaussian human model at 1000×1000 resolution in milliseconds on a single GPU. Compared to state-of-the-art approaches, our method achieves significant improvements in reconstruction accuracy, temporal coherence during animation, and flexibility for shape and texture editing. Moreover, the output representation seamlessly supports integration into downstream tasks such as animation, rendering, and editing.

Technology Category

Application Category

📝 Abstract
Creating a high-fidelity, animatable 3D full-body avatar from a single image is a challenging task due to the diverse appearance and poses of humans and the limited availability of high-quality training data. To achieve fast and high-quality human reconstruction, this work rethinks the task from the perspectives of dataset, model, and representation. First, we introduce a large-scale HUman-centric GEnerated dataset, HuGe100K, consisting of 100K diverse, photorealistic sets of human images. Each set contains 24-view frames in specific human poses, generated using a pose-controllable image-to-multi-view model. Next, leveraging the diversity in views, poses, and appearances within HuGe100K, we develop a scalable feed-forward transformer model to predict a 3D human Gaussian representation in a uniform space from a given human image. This model is trained to disentangle human pose, body shape, clothing geometry, and texture. The estimated Gaussians can be animated without post-processing. We conduct comprehensive experiments to validate the effectiveness of the proposed dataset and method. Our model demonstrates the ability to efficiently reconstruct photorealistic humans at 1K resolution from a single input image using a single GPU instantly. Additionally, it seamlessly supports various applications, as well as shape and texture editing tasks.
Problem

Research questions and friction points this paper is trying to address.

Create 3D human avatars from single images
Overcome limited training data diversity
Enable instant photorealistic reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale generated dataset HuGe100K
Feed-forward transformer for 3D Gaussians
Instant photorealistic 1K human reconstruction
🔎 Similar Papers
No similar papers found.