Multivariate Diffusion Transformer with Decoupled Attention for High-Fidelity Mask-Text Collaborative Facial Generation

📅 2025-11-16

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing multimodal face generation methods suffer from insufficient cross-modal interaction between semantic masks and textual descriptions, as well as inefficient feature fusion. To address these issues, this paper proposes a diffusion-based Transformer framework for collaborative generation. Its key contributions are: (1) a decoupled attention mechanism that separates dynamic text modeling from static mask modeling, enabling mask feature caching and reuse—reducing mask-related computational overhead by over 94%; and (2) a unified tokenization strategy coupled with a multi-variable Transformer module, facilitating synchronous conditional modeling of both mask and text inputs. Extensive experiments demonstrate that the proposed method significantly outperforms state-of-the-art approaches in generation fidelity, cross-modal alignment accuracy, and conditional consistency. This work establishes a new paradigm for efficient, high-quality, controllable multimodal image generation.

Technology Category

Application Category

📝 Abstract

While significant progress has been achieved in multimodal facial generation using semantic masks and textual descriptions, conventional feature fusion approaches often fail to enable effective cross-modal interactions, thereby leading to suboptimal generation outcomes. To address this challenge, we introduce MDiTFace--a customized diffusion transformer framework that employs a unified tokenization strategy to process semantic mask and text inputs, eliminating discrepancies between heterogeneous modality representations. The framework facilitates comprehensive multimodal feature interaction through stacked, newly designed multivariate transformer blocks that process all conditions synchronously. Additionally, we design a novel decoupled attention mechanism by dissociating implicit dependencies between mask tokens and temporal embeddings. This mechanism segregates internal computations into dynamic and static pathways, enabling caching and reuse of features computed in static pathways after initial calculation, thereby reducing additional computational overhead introduced by mask condition by over 94% while maintaining performance. Extensive experiments demonstrate that MDiTFace significantly outperforms other competing methods in terms of both facial fidelity and conditional consistency.

Problem

Research questions and friction points this paper is trying to address.

Improving cross-modal interaction in multimodal facial generation

Reducing computational overhead in mask-text fusion models

Enhancing facial fidelity and conditional consistency in generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified tokenization strategy processes mask and text inputs

Multivariate transformer blocks enable synchronous multimodal interaction

Decoupled attention mechanism reduces computational overhead significantly

🔎 Similar Papers

No similar papers found.