🤖 AI Summary
Addressing memory constraints, privacy sensitivity, and real-time multi-task inference requirements in edge-deployed NLP, this paper proposes EI-BERT—a ultra-lightweight model compression framework for BERT. Methodologically, it introduces (1) hard token pruning to dynamically eliminate redundant input tokens; (2) cross-distillation, enabling bidirectional knowledge transfer and parameter fusion between teacher and student models; and (3) synergistic quantization combined with multi-task knowledge distillation. The resulting model is a 1.91 MB BERT variant—the smallest general-purpose NLU model to date—retaining over 92% of the original BERT’s performance on the GLUE benchmark. Deployed in Alipay’s recommendation system, EI-BERT serves 8.4 million edge devices daily, demonstrating both practical utility and robustness under extreme compression.
📝 Abstract
In the era of mobile computing, deploying efficient Natural Language Processing (NLP) models in resource-restricted edge settings presents significant challenges, particularly in environments requiring strict privacy compliance, real-time responsiveness, and diverse multi-tasking capabilities. These challenges create a fundamental need for ultra-compact models that maintain strong performance across various NLP tasks while adhering to stringent memory constraints. To this end, we introduce Edge ultra-lIte BERT framework (EI-BERT) with a novel cross-distillation method. EI-BERT efficiently compresses models through a comprehensive pipeline including hard token pruning, cross-distillation and parameter quantization. Specifically, the cross-distillation method uniquely positions the teacher model to understand the student model's perspective, ensuring efficient knowledge transfer through parameter integration and the mutual interplay between models. Through extensive experiments, we achieve a remarkably compact BERT-based model of only 1.91 MB - the smallest to date for Natural Language Understanding (NLU) tasks. This ultra-compact model has been successfully deployed across multiple scenarios within the Alipay ecosystem, demonstrating significant improvements in real-world applications. For example, it has been integrated into Alipay's live Edge Recommendation system since January 2024, currently serving the app's recommendation traffic across extbf{8.4 million daily active devices}.