Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation

📅 2025-08-30

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Controllable face generation faces a fundamental trade-off between semantic controllability and photorealism, with existing methods failing to effectively decouple semantic control from the generative process. To address this, we propose Face-MoGLE—a Diffusion Transformer-based framework incorporating a global-local mixture-of-experts mechanism. It integrates mask-conditioned spatial decomposition, a dynamic gating network, and time-varying coefficient modeling to achieve joint structural-semantic disentanglement and hierarchical modulation. The framework enables fine-grained attribute editing and supports multimodal conditioning—including text and sketch—while exhibiting strong zero-shot generalization. Extensive experiments demonstrate significant improvements in both control precision and image fidelity across single- and multi-modal face generation tasks. Face-MoGLE establishes a novel, interpretable, and highly flexible paradigm for controllable generation, advancing the state of the art in semantic-aware diffusion modeling.

Technology Category

Application Category

📝 Abstract

Controllable face generation poses critical challenges in generative modeling due to the intricate balance required between semantic controllability and photorealism. While existing approaches struggle with disentangling semantic controls from generation pipelines, we revisit the architectural potential of Diffusion Transformers (DiTs) through the lens of expert specialization. This paper introduces Face-MoGLE, a novel framework featuring: (1) Semantic-decoupled latent modeling through mask-conditioned space factorization, enabling precise attribute manipulation; (2) A mixture of global and local experts that captures holistic structure and region-level semantics for fine-grained controllability; (3) A dynamic gating network producing time-dependent coefficients that evolve with diffusion steps and spatial locations. Face-MoGLE provides a powerful and flexible solution for high-quality, controllable face generation, with strong potential in generative modeling and security applications. Extensive experiments demonstrate its effectiveness in multimodal and monomodal face generation settings and its robust zero-shot generalization capability. Project page is available at https://github.com/XavierJiezou/Face-MoGLE.

Problem

Research questions and friction points this paper is trying to address.

Achieving precise semantic controllability in face generation

Balancing photorealism with attribute manipulation capabilities

Disentangling semantic controls from diffusion-based generation pipelines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mask-conditioned latent space factorization

Mixture of global and local experts

Dynamic time-dependent gating network

🔎 Similar Papers

Multimodal Conditional 3D Face Geometry Generation