Put Teacher in Student's Shoes: Cross-Distillation for Ultra-compact Model Compression Framework

📅 2025-07-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing memory constraints, privacy sensitivity, and real-time multi-task inference requirements in edge-deployed NLP, this paper proposes EI-BERT—a ultra-lightweight model compression framework for BERT. Methodologically, it introduces (1) hard token pruning to dynamically eliminate redundant input tokens; (2) cross-distillation, enabling bidirectional knowledge transfer and parameter fusion between teacher and student models; and (3) synergistic quantization combined with multi-task knowledge distillation. The resulting model is a 1.91 MB BERT variant—the smallest general-purpose NLU model to date—retaining over 92% of the original BERT’s performance on the GLUE benchmark. Deployed in Alipay’s recommendation system, EI-BERT serves 8.4 million edge devices daily, demonstrating both practical utility and robustness under extreme compression.

Technology Category

Application Category

📝 Abstract
In the era of mobile computing, deploying efficient Natural Language Processing (NLP) models in resource-restricted edge settings presents significant challenges, particularly in environments requiring strict privacy compliance, real-time responsiveness, and diverse multi-tasking capabilities. These challenges create a fundamental need for ultra-compact models that maintain strong performance across various NLP tasks while adhering to stringent memory constraints. To this end, we introduce Edge ultra-lIte BERT framework (EI-BERT) with a novel cross-distillation method. EI-BERT efficiently compresses models through a comprehensive pipeline including hard token pruning, cross-distillation and parameter quantization. Specifically, the cross-distillation method uniquely positions the teacher model to understand the student model's perspective, ensuring efficient knowledge transfer through parameter integration and the mutual interplay between models. Through extensive experiments, we achieve a remarkably compact BERT-based model of only 1.91 MB - the smallest to date for Natural Language Understanding (NLU) tasks. This ultra-compact model has been successfully deployed across multiple scenarios within the Alipay ecosystem, demonstrating significant improvements in real-world applications. For example, it has been integrated into Alipay's live Edge Recommendation system since January 2024, currently serving the app's recommendation traffic across extbf{8.4 million daily active devices}.
Problem

Research questions and friction points this paper is trying to address.

Deploy efficient NLP models in resource-restricted edge settings
Maintain strong performance with ultra-compact memory constraints
Achieve efficient knowledge transfer via novel cross-distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-distillation for efficient knowledge transfer
Hard token pruning and parameter quantization
Ultra-compact BERT model at 1.91 MB
🔎 Similar Papers
No similar papers found.
M
Maolin Wang
City University of Hong Kong and AntGroup
J
Jun Chu
AntGroup
S
Sicong Xie
AntGroup
X
Xiaoling Zang
AntGroup
Y
Yao Zhao
AntGroup
Wenliang Zhong
Wenliang Zhong
University of Science and Technology Beijing
OptimizationResources Allocation
X
Xiangyu Zhao
City University of Hong Kong