ViT-FIQA: Assessing Face Image Quality using Vision Transformers

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically investigates the potential of Vision Transformers (ViTs) for Face Image Quality Assessment (FIQA), addressing the limitations of prior CNN-based methods that fail to jointly model global semantics and image utility. To this end, we propose a learnable Quality Token that participates in global self-attention alongside patch tokens, and introduce a dual-branch architecture that jointly optimizes FIQA and face recognition. The model employs a quality regression head and a fully connected representation head, trained end-to-end using margin-penalized Softmax loss. Extensive experiments on multiple mainstream benchmarks—under diverse face recognition backbones—demonstrate state-of-the-art performance, validating ViT’s superior representational capacity, generalization ability, and scalability for FIQA.

Technology Category

Application Category

📝 Abstract
Face Image Quality Assessment (FIQA) aims to predict the utility of a face image for face recognition (FR) systems. State-of-the-art FIQA methods mainly rely on convolutional neural networks (CNNs), leaving the potential of Vision Transformer (ViT) architectures underexplored. This work proposes ViT-FIQA, a novel approach that extends standard ViT backbones, originally optimized for FR, through a learnable quality token designed to predict a scalar utility score for any given face image. The learnable quality token is concatenated with the standard image patch tokens, and the whole sequence is processed via global self-attention by the ViT encoders to aggregate contextual information across all patches. At the output of the backbone, ViT-FIQA branches into two heads: (1) the patch tokens are passed through a fully connected layer to learn discriminative face representations via a margin-penalty softmax loss, and (2) the quality token is fed into a regression head to learn to predict the face sample's utility. Extensive experiments on challenging benchmarks and several FR models, including both CNN- and ViT-based architectures, demonstrate that ViT-FIQA consistently achieves top-tier performance. These results underscore the effectiveness of transformer-based architectures in modeling face image utility and highlight the potential of ViTs as a scalable foundation for future FIQA research https://cutt.ly/irHlzXUC.
Problem

Research questions and friction points this paper is trying to address.

Assessing face image quality for recognition systems
Exploring Vision Transformers for quality prediction
Predicting utility scores via learnable quality tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Transformer backbone for face quality
Learnable quality token for utility prediction
Dual-head output for recognition and regression
🔎 Similar Papers
No similar papers found.