GPTFace: Generative Pre-training of Facial-Linguistic Transformer by Span Masking and Weakly Correlated Text-image Data

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing facial knowledge learning models rely heavily on costly, manually annotated data, resulting in poor generalization. To address this, we propose FaceGPT—the first large-scale generative multimodal pretraining framework tailored for facial knowledge learning. FaceGPT leverages weakly aligned image-text pairs crawled from the web and employs self-supervised joint modeling to achieve cross-modal facial–linguistic understanding and controllable generation. Key innovations include a novel matching loss to guide generation control signals, and the integration of span masking, masked image–language modeling (MILM), and image–text matching (ITM) within a Transformer architecture—enhancing representation robustness and generation controllability. Extensive experiments demonstrate state-of-the-art performance on downstream tasks including facial attribute classification and expression recognition. Moreover, FaceGPT supports diverse face generation applications such as attribute editing, expression manipulation, occlusion removal, and facial image inpainting.

Technology Category

Application Category

📝 Abstract
Compared to the prosperity of pre-training models in natural image understanding, the research on large-scale pre-training models for facial knowledge learning is still limited. Current approaches mainly rely on manually assembled and annotated face datasets for training, but labeling such datasets is labor-intensive and the trained models have limited scalability beyond the training data. To address these limitations, we present a generative pre-training model for facial knowledge learning that leverages large-scale web-built data for training. We use texts and images containing human faces crawled from the internet and conduct pre-training on self-supervised tasks, including masked image/language modeling (MILM) and image-text matching (ITM). During the generation stage, we further utilize the image-text matching loss to pull the generation distribution towards the control signal for controllable image/text generation. Experimental results demonstrate that our model achieves comparable performance to state-of-the-art pre-training models for various facial downstream tasks, such as attribution classification and expression recognition. Furthermore, our approach is also applicable to a wide range of face editing tasks, including face attribute editing, expression manipulation, mask removal, and photo inpainting.
Problem

Research questions and friction points this paper is trying to address.

Developing pre-training models for facial knowledge using web data
Addressing limited scalability of manually annotated face datasets
Enabling controllable facial image and text generation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative pre-training using weakly correlated web data
Self-supervised learning with masked image and language modeling
Controllable generation via image-text matching loss