Multimodal Conditional 3D Face Geometry Generation

📅 2024-07-01
🏛️ Computers & Graphics
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of generating high-fidelity, controllable 3D facial geometry from diverse input modalities. We propose a diffusion-based method that operates directly in UV parameterization space to synthesize high-resolution (512×512) facial geometry. Our approach unifies six heterogeneous conditioning inputs—sketches, Canny edges, 2D keypoints, FLAME parameters, portrait images, and text prompts—within a single model. To achieve fine-grained, disentangled control over identity and expression, we introduce a novel multi-path IP-Adapter cross-attention mechanism. Additionally, we jointly encode FLAME parameters and text embeddings to enforce geometric plausibility and semantic alignment. Extensive experiments demonstrate state-of-the-art performance in cross-modal consistency, identity/expression fidelity, and interactive flexibility, significantly advancing controllable 3D face generation.

Technology Category

Application Category

📝 Abstract
We present a new method for multimodal conditional 3D face geometry generation that allows user-friendly control over the output identity and expression via a number of different conditioning signals. Within a single model, we demonstrate 3D faces generated from artistic sketches, 2D face landmarks, Canny edges, FLAME face model parameters, portrait photos, or text prompts. Our approach is based on a diffusion process that generates 3D geometry in a 2D parameterized UV domain. Geometry generation passes each conditioning signal through a set of cross-attention layers (IP-Adapter), one set for each user-defined conditioning signal. The result is an easy-to-use 3D face generation tool that produces high resolution geometry with fine-grain user control.
Problem

Research questions and friction points this paper is trying to address.

Generating 3D face geometry from multimodal inputs
Enabling user control over identity and expression
Producing topology-consistent high-quality facial models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal conditional generation with cross-attention layers
Diffusion process in 2D UV domain for 3D geometry
Single model accepts multiple input signals including sketches
C
Christopher Otto
ETH Zürich, Switzerland and DisneyResearch |Studios, Switzerland
P
Prashanth Chandran
DisneyResearch |Studios, Switzerland
Sebastian Weiss
Sebastian Weiss
Disney Research Zürich
computer visualization and graphicsdeep learning
M
Markus H. Gross
ETH Zürich, Switzerland and DisneyResearch|Studios, Switzerland
G
Gaspard Zoss
DisneyResearch|Studios, Switzerland
Derek Bradley
Derek Bradley
DisneyResearch|Studios
Computer GraphicsComputer Vision