Efficient AI in Practice: Training and Deployment of Efficient LLMs for Industry Applications

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational overhead, latency, and deployment challenges of large language models (LLMs) in industrial settings, this paper proposes a training and deployment framework for efficient small language models (SLMs). Methodologically, it introduces the first systematic integration of knowledge distillation, weight quantization, and structured pruning—optimized for multi-task learning on professional social platforms—to achieve quality-efficiency Pareto optimality. Additionally, it designs a hardware-adaptive strategy featuring dual-mode (inference + prediction) collaboration, enabling heterogeneous CPU/GPU deployment and low-precision inference engine support. Experimental results demonstrate that the SLMs attain ≥92% of LLM performance on search, recommendation, and generation tasks, reduce training cost by 76%, decrease service latency by 5.3×, and improve throughput by 4.1×. The framework has been deployed at scale, supporting over 100 million daily requests in production.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated remarkable performance across a wide range of industrial applications, from search and recommendations to generative tasks. Although scaling laws indicate that larger models generally yield better generalization and performance, their substantial computational requirements often render them impractical for many real-world scenarios at scale. In this paper, we present methods and insights for training small language models (SLMs) that deliver high performance and efficiency in deployment. We focus on two key techniques: (1) knowledge distillation and (2) model compression via quantization and pruning. These approaches enable SLMs to retain much of the quality of their larger counterparts while significantly reducing training, serving costs, and latency. We detail the impact of these techniques on a variety of use cases at a large professional social network platform and share deployment lessons - including hardware optimization strategies that enhance speed and throughput for both predictive and reasoning-based applications.
Problem

Research questions and friction points this paper is trying to address.

Optimizing training and deployment of efficient LLMs
Reducing computational requirements for real-world applications
Enhancing performance and efficiency via knowledge distillation and compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge distillation techniques
Model compression methods
Hardware optimization strategies
🔎 Similar Papers
No similar papers found.