ViTNT-FIQA: Training-Free Face Image Quality Assessment with Vision Transformers

📅 2026-01-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work proposes a training-free, single forward-pass method for face image quality assessment that overcomes the limitations of existing approaches, which often rely on multiple forward or backward passes and utilize only the final-layer features of Vision Transformers (ViTs). The proposed method uniquely evaluates quality by analyzing the evolutionary stability of patch embeddings across intermediate ViT layers: high-quality images exhibit smooth feature transitions between layers, whereas low-quality ones show irregular fluctuations. By computing and aggregating Euclidean distances between L2-normalized embeddings of consecutive Transformer blocks, the method generates an image-level quality score using any off-the-shelf pretrained ViT without fine-tuning. It achieves state-of-the-art performance across eight benchmarks, including LFW and IJB-C, offering both high efficiency and plug-and-play versatility.

Technology Category

Application Category

📝 Abstract

Face Image Quality Assessment (FIQA) is essential for reliable face recognition systems. Current approaches primarily exploit only final-layer representations, while training-free methods require multiple forward passes or backpropagation. We propose ViTNT-FIQA, a training-free approach that measures the stability of patch embedding evolution across intermediate Vision Transformer (ViT) blocks. We demonstrate that high-quality face images exhibit stable feature refinement trajectories across blocks, while degraded images show erratic transformations. Our method computes Euclidean distances between L2-normalized patch embeddings from consecutive transformer blocks and aggregates them into image-level quality scores. We empirically validate this correlation on a quality-labeled synthetic dataset with controlled degradation levels. Unlike existing training-free approaches, ViTNT-FIQA requires only a single forward pass without backpropagation or architectural modifications. Through extensive evaluation on eight benchmarks (LFW, AgeDB-30, CFP-FP, CALFW, Adience, CPLFW, XQLFW, IJB-C), we show that ViTNT-FIQA achieves competitive performance with state-of-the-art methods while maintaining computational efficiency and immediate applicability to any pre-trained ViT-based face recognition model.

Problem

Research questions and friction points this paper is trying to address.

Face Image Quality Assessment

Training-Free

Vision Transformers

Patch Embedding Stability

Computational Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free

Vision Transformer

face image quality assessment