Proto-Former: Unified Facial Landmark Detection by Prototype Transformer

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing facial landmark detection methods suffer from weak generalization and difficulty in building unified models due to inconsistent landmark definitions across datasets and the prevalent single-dataset training paradigm. To address this, we propose Proto-Former—a novel framework enabling the first end-to-end joint training across multiple datasets. Proto-Former introduces an Adaptive Prototype Encoding Architecture (APAE) and a Progressive Prototype Decoding Architecture (PPAD), incorporating a prototype-aware mechanism and learnable prototype experts. Furthermore, we design a Prototype Alignment (PA) loss to effectively mitigate gradient conflicts and instability in expert assignment. Extensive experiments on benchmark datasets—including AFLW, 300W, and COFW—demonstrate that Proto-Former significantly outperforms state-of-the-art methods, achieving both superior cross-dataset generalization and higher landmark detection accuracy.

Technology Category

Application Category

📝 Abstract
Recent advances in deep learning have significantly improved facial landmark detection. However, existing facial landmark detection datasets often define different numbers of landmarks, and most mainstream methods can only be trained on a single dataset. This limits the model generalization to different datasets and hinders the development of a unified model. To address this issue, we propose Proto-Former, a unified, adaptive, end-to-end facial landmark detection framework that explicitly enhances dataset-specific facial structural representations (i.e., prototype). Proto-Former overcomes the limitations of single-dataset training by enabling joint training across multiple datasets within a unified architecture. Specifically, Proto-Former comprises two key components: an Adaptive Prototype-Aware Encoder (APAE) that performs adaptive feature extraction and learns prototype representations, and a Progressive Prototype-Aware Decoder (PPAD) that refines these prototypes to generate prompts that guide the model's attention to key facial regions. Furthermore, we introduce a novel Prototype-Aware (PA) loss, which achieves optimal path finding by constraining the selection weights of prototype experts. This loss function effectively resolves the problem of prototype expert addressing instability during multi-dataset training, alleviates gradient conflicts, and enables the extraction of more accurate facial structure features. Extensive experiments on widely used benchmark datasets demonstrate that our Proto-Former achieves superior performance compared to existing state-of-the-art methods. The code is publicly available at: https://github.com/Husk021118/Proto-Former.
Problem

Research questions and friction points this paper is trying to address.

Unifying facial landmark detection across datasets with varying landmark definitions
Overcoming single-dataset training limitations through joint multi-dataset learning
Resolving prototype expert instability and gradient conflicts in training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified facial landmark detection using prototype transformer
Adaptive prototype-aware encoder for feature extraction
Progressive decoder with prototype-aware loss optimization
🔎 Similar Papers
No similar papers found.
S
Shengkai Hu
School of Information Engineering, Zhongnan University of Economics and Law, Wuhan 430073, China, and School of Computer Science and Engineering (SCSE), Nanyang Technological University, Singapore 639798
Haozhe Qi
Haozhe Qi
EPFL
MLLM3Dpose estimationmotion generationvideo understanding
J
Jun Wan
School of Information Engineering, Zhongnan University of Economics and Law, Wuhan 430073, China, and School of Computer Science and Engineering (SCSE), Nanyang Technological University, Singapore 639798
J
Jiaxing Huang
School of Computer Science and Engineering (SCSE), Nanyang Technological University, Singapore 639798
Lefei Zhang
Lefei Zhang
School of Computer Science, Wuhan University
Pattern RecognitionMachine LearningImage ProcessingRemote Sensing
H
Hang Sun
Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydroelectric Engineering, College of Computer and Information Technology, China Three Gorges University, Yichang 443002, China
Dacheng Tao
Dacheng Tao
Nanyang Technological University
artificial intelligencemachine learningcomputer visionimage processingdata mining