🤖 AI Summary
Existing unified multimodal models (UMMs) support only coarse-grained facial attribute understanding and lack fine-grained modeling (e.g., micro-expressions, texture, local structures) and joint understanding-generation capabilities.
Method: We propose UniF²ace—the first face-specialized unified multimodal model. It introduces a facially customized unified framework; establishes, for the first time, a theoretical connection between discrete diffusion score matching and masked generation; and designs a token- and sequence-level dual-tier Mixture-of-Experts (MoE) architecture to jointly optimize understanding and generation. Trained on a self-collected 130K image–text–QA triplet dataset with a dual-diffusion training strategy, UniF²ace jointly optimizes the ELBO objective.
Results: UniF²ace achieves state-of-the-art performance across fine-grained recognition, controllable editing, and high-fidelity generation—outperforming both existing UMMs and dedicated generative models.
📝 Abstract
Unified multimodal models (UMMs) have emerged as a powerful paradigm in foundational computer vision research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily focuses on $ extbf{coarse}$ facial attribute understanding, with limited capacity to handle $ extbf{fine-grained}$ facial attributes and without addressing generation capabilities. To overcome these limitations, we propose Uni$ extbf{F}^2$ace, the first UMM tailored specifically for fine-grained face understanding and generation. In general, we train Uni$ extbf{F}^2$ace on a self-constructed, specialized dataset utilizing two mutually beneficial diffusion techniques and a two-level mixture-of-experts architecture. Specifically, we first build a large-scale facial dataset, Uni$ extbf{F}^2$ace-130K, which contains 130K image-text pairs with one million question-answering pairs that span a wide range of facial attributes. Second, we establish a theoretical connection between discrete diffusion score matching and masked generative models, optimizing both evidence lower bounds simultaneously, which significantly improves the model's ability to synthesize facial details. Finally, we introduce both token-level and sequence-level mixture-of-experts, enabling efficient fine-grained representation learning for both understanding and generation tasks. Extensive experiments on Uni$ extbf{F}^2$ace-130K demonstrate that Uni$ extbf{F}^2$ace outperforms existing UMMs and generative models, achieving superior performance across both understanding and generation tasks.