Efficient AI in Practice: Training and Deployment of Efficient LLMs for Industry Applications

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

To address the high computational overhead, latency, and deployment challenges of large language models (LLMs) in industrial settings, this paper proposes a training and deployment framework for efficient small language models (SLMs). Methodologically, it introduces the first systematic integration of knowledge distillation, weight quantization, and structured pruning—optimized for multi-task learning on professional social platforms—to achieve quality-efficiency Pareto optimality. Additionally, it designs a hardware-adaptive strategy featuring dual-mode (inference + prediction) collaboration, enabling heterogeneous CPU/GPU deployment and low-precision inference engine support. Experimental results demonstrate that the SLMs attain ≥92% of LLM performance on search, recommendation, and generation tasks, reduce training cost by 76%, decrease service latency by 5.3×, and improve throughput by 4.1×. The framework has been deployed at scale, supporting over 100 million daily requests in production.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated remarkable performance across a wide range of industrial applications, from search and recommendations to generative tasks. Although scaling laws indicate that larger models generally yield better generalization and performance, their substantial computational requirements often render them impractical for many real-world scenarios at scale. In this paper, we present methods and insights for training small language models (SLMs) that deliver high performance and efficiency in deployment. We focus on two key techniques: (1) knowledge distillation and (2) model compression via quantization and pruning. These approaches enable SLMs to retain much of the quality of their larger counterparts while significantly reducing training, serving costs, and latency. We detail the impact of these techniques on a variety of use cases at a large professional social network platform and share deployment lessons - including hardware optimization strategies that enhance speed and throughput for both predictive and reasoning-based applications.

Problem

Research questions and friction points this paper is trying to address.

Optimizing training and deployment of efficient LLMs

Reducing computational requirements for real-world applications

Enhancing performance and efficiency via knowledge distillation and compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge distillation techniques

Model compression methods

Hardware optimization strategies

🔎 Similar Papers

No similar papers found.

ByteDance

圣何塞

Sr. Large Model Training Acceleration Engineer

TikTok

San Jose, California

Authors to Follow