🤖 AI Summary
Existing 3D human generation methods face a fundamental trade-off: directly adapting 2D diffusion models often sacrifices local geometric detail, while image-based geometric reconstruction struggles to ensure global view consistency. This paper introduces Joint2Human, the first end-to-end framework for high-fidelity, pose- and text-controllable 3D human geometry generation. Our key contributions are: (1) a compact spherical joint embedding that enables precise, computationally efficient pose control; and (2) a Fourier Occupancy Field (FOF) representation, which jointly models global structure and fine-grained local geometry via high-frequency signal enhancement and multi-view re-carving. Joint2Human achieves superior computational efficiency while significantly improving resolution, inter-view consistency, and geometric fidelity. It outperforms state-of-the-art methods on both generative and reconstruction benchmarks.
📝 Abstract
3D human generation is increasingly significant in var-ious applications. However, the direct use of 2D genera-tive methods in 3D generation often results in losing lo-cal details, while methods that reconstruct geometry from generated images struggle with global view consistency. In this work, we introduce joint2Human, a novel method that leverages 2D diffusion models to generate detailed 3D human geometry directly, ensuring both global structure and local details. To achieve this, we employ the Fourier occupancy field (FOF) representation, enabling the direct generation of 3D shapes as preliminary results with 2D generative models. With the proposed high-frequency enhancer and the multi-view recarving strategy, our method can seamlessly integrate the details from different views into a uniform global shape. To better utilize the 3D human prior and enhance control over the generated geometry, we introduce a compact spherical embedding of 3D joints. This allows for an effective guidance of pose during the gener-ation process. Additionally, our method can generate 3D humans guided by textual inputs. Our experimental results demonstrate the capability of our method to ensure global structure, local details, high resolution, and low computational cost simultaneously. More results and the code can be found on our project page at http://cic.tju.edu.cn/faculty/likun/projects/Joint2Human.