SyncHuman: Synchronizing 2D and 3D Generative Models for Single-view Human Reconstruction

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-fidelity 3D dressed human reconstruction from a single image remains highly challenging due to pose ambiguity, severe self-occlusion, and fine-detail loss. To address these issues, we propose SyncHuman—a novel framework that jointly models 2D multi-view generation and native 3D generation for the first time. Our core innovation is a pixel-aligned 2D–3D synchronized attention mechanism, enabling geometrically consistent joint generation of 3D shape and multi-view images. We further introduce 2D detail-guided feature injection and cross-dimensional feature enhancement, significantly improving structural robustness under complex poses and fidelity of surface details. Extensive evaluations on multiple benchmarks demonstrate that SyncHuman outperforms state-of-the-art SMPL-based and pure 3D generative methods in both geometric accuracy (e.g., Chamfer Distance, Pose Vertex Error) and visual realism (e.g., Fréchet Inception Distance, Learned Perceptual Image Patch Similarity). The method achieves high-quality, robust monocular 3D human reconstruction.

Technology Category

Application Category

📝 Abstract
Photorealistic 3D full-body human reconstruction from a single image is a critical yet challenging task for applications in films and video games due to inherent ambiguities and severe self-occlusions. While recent approaches leverage SMPL estimation and SMPL-conditioned image generative models to hallucinate novel views, they suffer from inaccurate 3D priors estimated from SMPL meshes and have difficulty in handling difficult human poses and reconstructing fine details. In this paper, we propose SyncHuman, a novel framework that combines 2D multiview generative model and 3D native generative model for the first time, enabling high-quality clothed human mesh reconstruction from single-view images even under challenging human poses. Multiview generative model excels at capturing fine 2D details but struggles with structural consistency, whereas 3D native generative model generates coarse yet structurally consistent 3D shapes. By integrating the complementary strengths of these two approaches, we develop a more effective generation framework. Specifically, we first jointly fine-tune the multiview generative model and the 3D native generative model with proposed pixel-aligned 2D-3D synchronization attention to produce geometrically aligned 3D shapes and 2D multiview images. To further improve details, we introduce a feature injection mechanism that lifts fine details from 2D multiview images onto the aligned 3D shapes, enabling accurate and high-fidelity reconstruction. Extensive experiments demonstrate that SyncHuman achieves robust and photo-realistic 3D human reconstruction, even for images with challenging poses. Our method outperforms baseline methods in geometric accuracy and visual fidelity, demonstrating a promising direction for future 3D generation models.
Problem

Research questions and friction points this paper is trying to address.

Reconstructing photorealistic 3D humans from single images
Overcoming inaccurate 3D priors and challenging human poses
Integrating 2D detail generation with 3D structural consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining 2D multiview and 3D native generative models
Using pixel-aligned 2D-3D synchronization attention mechanism
Injecting 2D multiview features onto aligned 3D shapes
🔎 Similar Papers
No similar papers found.
W
Wenyue Chen
PKU
P
Peng Li
HKUST
W
Wangguandong Zheng
SEU
C
Chengfeng Zhao
HKUST
M
Mengfei Li
HKUST
Y
Yaolong Zhu
PKU
Z
Zhiyang Dou
MIT
Ronggang Wang
Ronggang Wang
Shenzhen Graduate School, Peking University
Immersive Video Coding and Processing
Y
Yuan Liu
HKUST