🤖 AI Summary
To address the high computational overhead, latency, and deployment challenges of large language models (LLMs) in industrial settings, this paper proposes a training and deployment framework for efficient small language models (SLMs). Methodologically, it introduces the first systematic integration of knowledge distillation, weight quantization, and structured pruning—optimized for multi-task learning on professional social platforms—to achieve quality-efficiency Pareto optimality. Additionally, it designs a hardware-adaptive strategy featuring dual-mode (inference + prediction) collaboration, enabling heterogeneous CPU/GPU deployment and low-precision inference engine support. Experimental results demonstrate that the SLMs attain ≥92% of LLM performance on search, recommendation, and generation tasks, reduce training cost by 76%, decrease service latency by 5.3×, and improve throughput by 4.1×. The framework has been deployed at scale, supporting over 100 million daily requests in production.
📝 Abstract
Large language models (LLMs) have demonstrated remarkable performance across a wide range of industrial applications, from search and recommendations to generative tasks. Although scaling laws indicate that larger models generally yield better generalization and performance, their substantial computational requirements often render them impractical for many real-world scenarios at scale. In this paper, we present methods and insights for training small language models (SLMs) that deliver high performance and efficiency in deployment. We focus on two key techniques: (1) knowledge distillation and (2) model compression via quantization and pruning. These approaches enable SLMs to retain much of the quality of their larger counterparts while significantly reducing training, serving costs, and latency. We detail the impact of these techniques on a variety of use cases at a large professional social network platform and share deployment lessons - including hardware optimization strategies that enhance speed and throughput for both predictive and reasoning-based applications.