Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation

📅 2025-08-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Controllable face generation faces a fundamental trade-off between semantic controllability and photorealism, with existing methods failing to effectively decouple semantic control from the generative process. To address this, we propose Face-MoGLE—a Diffusion Transformer-based framework incorporating a global-local mixture-of-experts mechanism. It integrates mask-conditioned spatial decomposition, a dynamic gating network, and time-varying coefficient modeling to achieve joint structural-semantic disentanglement and hierarchical modulation. The framework enables fine-grained attribute editing and supports multimodal conditioning—including text and sketch—while exhibiting strong zero-shot generalization. Extensive experiments demonstrate significant improvements in both control precision and image fidelity across single- and multi-modal face generation tasks. Face-MoGLE establishes a novel, interpretable, and highly flexible paradigm for controllable generation, advancing the state of the art in semantic-aware diffusion modeling.

Technology Category

Application Category

📝 Abstract
Controllable face generation poses critical challenges in generative modeling due to the intricate balance required between semantic controllability and photorealism. While existing approaches struggle with disentangling semantic controls from generation pipelines, we revisit the architectural potential of Diffusion Transformers (DiTs) through the lens of expert specialization. This paper introduces Face-MoGLE, a novel framework featuring: (1) Semantic-decoupled latent modeling through mask-conditioned space factorization, enabling precise attribute manipulation; (2) A mixture of global and local experts that captures holistic structure and region-level semantics for fine-grained controllability; (3) A dynamic gating network producing time-dependent coefficients that evolve with diffusion steps and spatial locations. Face-MoGLE provides a powerful and flexible solution for high-quality, controllable face generation, with strong potential in generative modeling and security applications. Extensive experiments demonstrate its effectiveness in multimodal and monomodal face generation settings and its robust zero-shot generalization capability. Project page is available at https://github.com/XavierJiezou/Face-MoGLE.
Problem

Research questions and friction points this paper is trying to address.

Achieving precise semantic controllability in face generation
Balancing photorealism with attribute manipulation capabilities
Disentangling semantic controls from diffusion-based generation pipelines
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mask-conditioned latent space factorization
Mixture of global and local experts
Dynamic time-dependent gating network
🔎 Similar Papers
No similar papers found.
X
Xuechao Zou
Beijing Jiaotong University
S
Shun Zhang
Beijing Jiaotong University
Xing Fu
Xing Fu
Ant Group
Y
Yue Li
Qinghai University
K
Kai Li
Tsinghua University
Y
Yushe Cao
Tsinghua University
Congyan Lang
Congyan Lang
Beijing Jiaotong University
computer vision
P
Pin Tao
Tsinghua University
J
Junliang Xing
Tsinghua University