GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-guided 3D human generation methods suffer from prohibitively long training times, coarse facial and clothing details, and inconsistent textures across views. To address these issues, we propose a two-stage framework. In Stage I, we introduce Adaptive Human Distillation Sampling (AHDS), integrating 3D Gaussian splatting with an enhanced Score Distillation Sampling (SDS) scheme to achieve fast, identity-preserving coarse-grained human reconstruction. In Stage II, we design the View-Consistent Refinement (VCR) module, which jointly leverages cross-attention and distance-guided attention to explicitly enforce multi-view geometric and appearance constraints, significantly improving facial/clothing fidelity and inter-view texture consistency. Our work pioneers a human-centric distillation mechanism and the VCR strategy—reducing training steps substantially while preserving identity consistency. Extensive experiments demonstrate state-of-the-art performance in visual fidelity, detail realism, and cross-view consistency.

Technology Category

Application Category

📝 Abstract
Text-guided 3D human generation has advanced with the development of efficient 3D representations and 2D-lifting methods like Score Distillation Sampling (SDS). However, current methods suffer from prolonged training times and often produce results that lack fine facial and garment details. In this paper, we propose GaussianIP, an effective two-stage framework for generating identity-preserving realistic 3D humans from text and image prompts. Our core insight is to leverage human-centric knowledge to facilitate the generation process. In stage 1, we propose a novel Adaptive Human Distillation Sampling (AHDS) method to rapidly generate a 3D human that maintains high identity consistency with the image prompt and achieves a realistic appearance. Compared to traditional SDS methods, AHDS better aligns with the human-centric generation process, enhancing visual quality with notably fewer training steps. To further improve the visual quality of the face and clothes regions, we design a View-Consistent Refinement (VCR) strategy in stage 2. Specifically, it produces detail-enhanced results of the multi-view images from stage 1 iteratively, ensuring the 3D texture consistency across views via mutual attention and distance-guided attention fusion. Then a polished version of the 3D human can be achieved by directly perform reconstruction with the refined images. Extensive experiments demonstrate that GaussianIP outperforms existing methods in both visual quality and training efficiency, particularly in generating identity-preserving results. Our code is available at: https://github.com/silence-tang/GaussianIP.
Problem

Research questions and friction points this paper is trying to address.

Generates identity-preserving 3D humans from text and image prompts.
Reduces training time and enhances facial and garment details.
Improves visual quality with adaptive human-centric generation techniques.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Human Distillation Sampling for rapid 3D human generation
View-Consistent Refinement enhances face and garment details
Human-centric diffusion prior ensures identity-preserving realistic results
🔎 Similar Papers
No similar papers found.
Z
Zichen Tang
School of Artificial Intelligence, Beihang University, Beijing, China
Y
Yuan Yao
Miaomiao Cui
Miaomiao Cui
alibaba group
Computer VisionGenerative ModelsImage SynthesisSemantic Segmentation
Liefeng Bo
Liefeng Bo
Head of Applied Computer Vision Lab at Alibaba Group
Machine LearningComputer VisionRobotics
H
Hongyu Yang
School of Artificial Intelligence, Beihang University, Beijing, China; Shanghai Artificial Intelligence Laboratory, Shanghai, China