FaceLiVT: Face Recognition using Linear Vision Transformer with Structural Reparameterization For Mobile Device

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of jointly optimizing accuracy, latency, and computational cost in mobile real-time face verification, this paper proposes FaceLiVT—a lightweight and efficient model. Methodologically, FaceLiVT introduces three key innovations: (1) multi-head linear attention (MHLA), a novel attention mechanism that drastically reduces the quadratic computational complexity of standard Transformers; (2) a structurally reparameterized token mixer that enhances joint local-global feature modeling; and (3) a hybrid CNN–linear vision Transformer architecture that balances high representational capacity with ultra-low inference latency. Evaluated on standard benchmarks including LFW and CFP-FP, FaceLiVT achieves superior accuracy over state-of-the-art lightweight models. It attains 8.6× faster inference than EdgeFace and 21.2× faster than ViT, enabling millisecond-level on-device deployment on edge devices.

Technology Category

Application Category

📝 Abstract
This paper introduces FaceLiVT, a lightweight yet powerful face recognition model that integrates a hybrid Convolution Neural Network (CNN)-Transformer architecture with an innovative and lightweight Multi-Head Linear Attention (MHLA) mechanism. By combining MHLA alongside a reparameterized token mixer, FaceLiVT effectively reduces computational complexity while preserving competitive accuracy. Extensive evaluations on challenging benchmarks; including LFW, CFP-FP, AgeDB-30, IJB-B, and IJB-C; highlight its superior performance compared to state-of-the-art lightweight models. MHLA notably improves inference speed, allowing FaceLiVT to deliver high accuracy with lower latency on mobile devices. Specifically, FaceLiVT is 8.6 faster than EdgeFace, a recent hybrid CNN-Transformer model optimized for edge devices, and 21.2 faster than a pure ViT-Based model. With its balanced design, FaceLiVT offers an efficient and practical solution for real-time face recognition on resource-constrained platforms.
Problem

Research questions and friction points this paper is trying to address.

Develop lightweight face recognition for mobile devices
Reduce computational complexity while maintaining accuracy
Improve inference speed on resource-constrained platforms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid CNN-Transformer architecture for lightweight design
Multi-Head Linear Attention reduces computational complexity
Structural reparameterization enhances mobile inference speed
🔎 Similar Papers
No similar papers found.