Scaling Up Efficient Small Language Models Serving and Deployment for Semantic Job Search

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address the prohibitively high computational cost and inability of large language models (LLMs) to meet stringent low-latency and high-throughput requirements in LinkedIn’s semantic job search, this paper proposes a co-optimization framework for small language models (SLMs) with pure-text decoders. Our method jointly optimizes structured pruning and semantic-aware context compression: structured pruning reduces model parameters by 40%, while context-aware input sequence compression achieves an average 10× reduction in context length. We further integrate GPU kernel optimization and a lightweight serving architecture. Evaluated in LinkedIn’s production environment, the system achieves a 10× throughput improvement—reaching one million queries per second—with P99 latency consistently below 50 ms, while preserving retrieval quality (Recall@10 degradation < 0.3%). This work establishes an efficient, scalable deployment paradigm for SLMs in large-scale semantic search.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated impressive quality when applied to predictive tasks such as relevance ranking and semantic search. However, deployment of such LLMs remains prohibitively expensive for industry applications with strict latency and throughput requirements. In this work, we present lessons and efficiency insights from developing a purely text-based decoder-only Small Language Model (SLM) for a semantic search application at LinkedIn. Particularly, we discuss model compression techniques such as pruning that allow us to reduce the model size by up to $40%$ while maintaining the accuracy. Additionally, we present context compression techniques that allow us to reduce the input context length by up to $10$x with minimal loss of accuracy. Finally, we present practical lessons from optimizing the serving infrastructure for deploying such a system on GPUs at scale, serving millions of requests per second. Taken together, this allows us to increase our system's throughput by $10$x in a real-world deployment, while meeting our quality bar.

Problem

Research questions and friction points this paper is trying to address.

Optimizing small language models for efficient semantic job search deployment

Applying model compression to reduce size while maintaining accuracy

Scaling serving infrastructure for high-throughput industry applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed decoder-only small language model for search

Applied pruning to reduce model size by 40%

Compressed input context length by 10x

🔎 Similar Papers

Towards Pareto Optimal Throughput in Small Language Model Serving

2024-04-04EuroMLSys@EuroSysCitations: 5

ByteDance

圣何塞

Authors to Follow