SISMA: Semantic Face Image Synthesis with Mamba

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing diffusion models for semantic face image synthesis suffer from high computational overhead—specifically, quadratic complexity O(N²)—due to self-attention mechanisms. To address this, we propose SISMA, the first method to integrate the linear-complexity Mamba state space model into this task, replacing Transformer-based attention layers and incorporating semantic mask guidance for controlled generation. SISMA unifies long-sequence modeling with fine-grained structural control. On CelebAMask-HQ, it achieves superior FID scores compared to state-of-the-art methods and attains inference speed three times faster than the current best-performing model. This work significantly reduces both training and inference costs while empirically validating the effectiveness and scalability of the Mamba architecture in generative vision tasks. Moreover, it establishes a novel paradigm for efficient and controllable face synthesis, advancing the practical deployment of semantic-aware diffusion models.

Technology Category

Application Category

📝 Abstract

Diffusion Models have become very popular for Semantic Image Synthesis (SIS) of human faces. Nevertheless, their training and inference is computationally expensive and their computational requirements are high due to the quadratic complexity of attention layers. In this paper, we propose a novel architecture called SISMA, based on the recently proposed Mamba. SISMA generates high quality samples by controlling their shape using a semantic mask at a reduced computational demand. We validated our approach through comprehensive experiments with CelebAMask-HQ, revealing that our architecture not only achieves a better FID score yet also operates at three times the speed of state-of-the-art architectures. This indicates that the proposed design is a viable, lightweight substitute to transformer-based models.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational complexity in semantic face image synthesis

Overcoming quadratic complexity limitations of attention-based models

Creating lightweight alternative to transformer-based diffusion models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Mamba architecture for semantic face synthesis

Replaces attention layers to reduce computational complexity

Achieves faster inference with improved quality metrics

🔎 Similar Papers

VIGFace: Virtual Identity Generation for Privacy-Free Face Recognition