Uni$ extbf{F}^2$ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing unified multimodal models (UMMs) support only coarse-grained facial attribute understanding and lack fine-grained modeling (e.g., micro-expressions, texture, local structures) and joint understanding-generation capabilities. Method: We propose UniF²ace—the first face-specialized unified multimodal model. It introduces a facially customized unified framework; establishes, for the first time, a theoretical connection between discrete diffusion score matching and masked generation; and designs a token- and sequence-level dual-tier Mixture-of-Experts (MoE) architecture to jointly optimize understanding and generation. Trained on a self-collected 130K image–text–QA triplet dataset with a dual-diffusion training strategy, UniF²ace jointly optimizes the ELBO objective. Results: UniF²ace achieves state-of-the-art performance across fine-grained recognition, controllable editing, and high-fidelity generation—outperforming both existing UMMs and dedicated generative models.

Technology Category

Application Category

📝 Abstract

Unified multimodal models (UMMs) have emerged as a powerful paradigm in foundational computer vision research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily focuses on $ extbf{coarse}$ facial attribute understanding, with limited capacity to handle $ extbf{fine-grained}$ facial attributes and without addressing generation capabilities. To overcome these limitations, we propose Uni$ extbf{F}^2$ace, the first UMM tailored specifically for fine-grained face understanding and generation. In general, we train Uni$ extbf{F}^2$ace on a self-constructed, specialized dataset utilizing two mutually beneficial diffusion techniques and a two-level mixture-of-experts architecture. Specifically, we first build a large-scale facial dataset, Uni$ extbf{F}^2$ace-130K, which contains 130K image-text pairs with one million question-answering pairs that span a wide range of facial attributes. Second, we establish a theoretical connection between discrete diffusion score matching and masked generative models, optimizing both evidence lower bounds simultaneously, which significantly improves the model's ability to synthesize facial details. Finally, we introduce both token-level and sequence-level mixture-of-experts, enabling efficient fine-grained representation learning for both understanding and generation tasks. Extensive experiments on Uni$ extbf{F}^2$ace-130K demonstrate that Uni$ extbf{F}^2$ace outperforms existing UMMs and generative models, achieving superior performance across both understanding and generation tasks.

Problem

Research questions and friction points this paper is trying to address.

Develops a unified multimodal model for fine-grained face understanding and generation.

Addresses limitations in handling detailed facial attributes and generation capabilities.

Introduces a specialized dataset and advanced diffusion techniques for improved performance.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal models for fine-grained face tasks

Utilizes two diffusion techniques and mixture-of-experts architecture

Self-constructed dataset with 130K image-text pairs

🔎 Similar Papers

Multimodal Conditional 3D Face Geometry Generation