🤖 AI Summary
To address the challenge of jointly optimizing accuracy, latency, and computational cost in mobile real-time face verification, this paper proposes FaceLiVT—a lightweight and efficient model. Methodologically, FaceLiVT introduces three key innovations: (1) multi-head linear attention (MHLA), a novel attention mechanism that drastically reduces the quadratic computational complexity of standard Transformers; (2) a structurally reparameterized token mixer that enhances joint local-global feature modeling; and (3) a hybrid CNN–linear vision Transformer architecture that balances high representational capacity with ultra-low inference latency. Evaluated on standard benchmarks including LFW and CFP-FP, FaceLiVT achieves superior accuracy over state-of-the-art lightweight models. It attains 8.6× faster inference than EdgeFace and 21.2× faster than ViT, enabling millisecond-level on-device deployment on edge devices.
📝 Abstract
This paper introduces FaceLiVT, a lightweight yet powerful face recognition model that integrates a hybrid Convolution Neural Network (CNN)-Transformer architecture with an innovative and lightweight Multi-Head Linear Attention (MHLA) mechanism. By combining MHLA alongside a reparameterized token mixer, FaceLiVT effectively reduces computational complexity while preserving competitive accuracy. Extensive evaluations on challenging benchmarks; including LFW, CFP-FP, AgeDB-30, IJB-B, and IJB-C; highlight its superior performance compared to state-of-the-art lightweight models. MHLA notably improves inference speed, allowing FaceLiVT to deliver high accuracy with lower latency on mobile devices. Specifically, FaceLiVT is 8.6 faster than EdgeFace, a recent hybrid CNN-Transformer model optimized for edge devices, and 21.2 faster than a pure ViT-Based model. With its balanced design, FaceLiVT offers an efficient and practical solution for real-time face recognition on resource-constrained platforms.