Exploiting Student Parallelism for Low-latency GPU Inference of BERT-like Models in Online Services

📅 2024-08-22

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address high latency caused by model depth and substantial scheduling overhead under dynamic workloads in online inference for BERT-like large language models, this paper proposes “Student Parallelism”—a novel paradigm that compresses a deep teacher model into multiple shallow, independently executable student models via stacked distillation and boosting ensemble, enabling dynamic scaling of student count to accommodate traffic spikes. We further design a GPU-customized scheduler to enable fine-grained request dispatching and cross-model resource coordination. Our approach achieves zero-accuracy-loss inference while reducing latency by 1.6×–4.1× and improving throughput by up to 22.27× under bursty loads. This work is the first to tightly integrate model compression, dynamic parallelization, and systems-level scheduling, delivering an end-to-end solution for large-model online serving that simultaneously ensures low latency, high elasticity, and full accuracy.

Technology Category

Application Category

📝 Abstract

Due to high accuracy, BERT-like models have been widely adopted by discriminative text mining and web searching. However, large BERT-like models suffer from inefficient online inference, as they face the following two problems on GPUs. First, they rely on the large model depth to achieve high accuracy, which linearly increases the sequential computation on GPUs. Second, stochastic and dynamic online workloads cause extra costs. In this paper, we present Academus for low-latency online inference of BERT-like models. At the core of Academus is the novel student parallelism, which adopts boosting ensemble and stacking distillation to distill the original deep model into an equivalent group of parallel and shallow student models. This enables Academus to achieve the lower model depth (e.g., two layers) than baselines and consequently the lowest inference latency without affecting the accuracy.For occasional workload bursts, it can temporarily decrease the number of students with minimal accuracy loss to improve throughput. Additionally, it employs specialized system designs for student parallelism to better handle stochastic online workloads. We conduct comprehensive experiments to verify the effectiveness. The results show that Academus outperforms the baselines by 4.1X~1.6X in latency without compromising accuracy, and achieves up to 22.27X higher throughput for workload bursts.

Problem

Research questions and friction points this paper is trying to address.

Reduces BERT-like model inference latency on GPUs.

Minimizes costs from dynamic online workloads.

Improves throughput during workload bursts efficiently.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel student models reduce depth

Adaptive pruning adjusts to workloads

Stacking distillation maintains accuracy efficiently

🔎 Similar Papers

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval