🤖 AI Summary
Existing facial analysis approaches suffer from poor scalability due to reliance on task-specific architectures and hand-crafted preprocessing pipelines. To address this, we propose FaceX—the first end-to-end unified Transformer model for simultaneous facial parsing, landmark detection, pose estimation, attribute/age/gender/race prediction, expression recognition, face identification, and visibility estimation—totaling nine tasks. Our key contributions are: (1) learnable task tokens enabling adaptive task representation; (2) a parameter-efficient FaceX decoder facilitating cross-task sharing and generalization; and (3) multi-dataset joint training and in-the-wild evaluation. FaceX achieves state-of-the-art performance across multiple benchmarks, demonstrates strong cross-domain generalization, and attains real-time inference at 33.21 FPS—marking the first unified framework delivering high-accuracy, real-time, multi-task facial understanding.
📝 Abstract
In this work, we introduce FaceXFormer, an end-to-end unified transformer model capable of performing nine facial analysis tasks including face parsing, landmark detection, head pose estimation, attribute prediction, and estimation of age, gender, race, expression, and face visibility within a single framework. Conventional methods in face analysis have often relied on task-specific designs and pre-processing techniques, which limit their scalability and integration into a unified architecture. Unlike these conventional methods, FaceXFormer leverages a transformer-based encoder-decoder architecture where each task is treated as a learnable token, enabling the seamless integration and simultaneous processing of multiple tasks within a single framework. Moreover, we propose a novel parameter-efficient decoder, FaceX, which jointly processes face and task tokens, thereby learning generalized and robust face representations across different tasks. We jointly trained FaceXFormer on nine face perception datasets and conducted experiments against specialized and multi-task models in both intra-dataset and cross-dataset evaluations across multiple benchmarks, showcasing state-of-the-art or competitive performance. Further, we performed a comprehensive analysis of different backbones for unified face task processing and evaluated our model"in-the-wild", demonstrating its robustness and generalizability. To the best of our knowledge, this is the first work to propose a single model capable of handling nine facial analysis tasks while maintaining real-time performance at 33.21 FPS.