EchoLM: Accelerating LLM Serving with Real-time Knowledge Distillation

📅 2025-01-22

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

To address high latency and low throughput bottlenecks in large language model (LLM) online serving, this paper proposes a semantic-aware request reuse and dynamic routing framework. Methodologically, it introduces: (1) a novel utility-aware real-time cache replay mechanism enabling context caching and fine-tuning-free cross-model knowledge distillation; (2) an adaptive LLM routing framework that jointly optimizes quality, latency, and throughput via semantic similarity retrieval, dynamic in-context example selection, and multi-level load-aware scheduling; and (3) an offline cost-aware cache optimization strategy. Evaluated on over one million real-world requests, the framework achieves 1.4×–5.9× higher throughput and 28%–71% lower latency, while preserving response quality.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have excelled in various applications, yet serving them at scale is challenging due to their substantial resource demands and high latency. Our real-world studies reveal that over 60% of user requests to LLMs have semantically similar counterparts, suggesting the potential for knowledge sharing among requests. However, naively caching and reusing past responses leads to large quality degradation. In this paper, we introduce EchoLM, an in-context caching system that leverages historical requests as examples to guide response generation, enabling selective offloading of requests to more efficient LLMs. However, enabling this real-time knowledge transfer leads to intricate tradeoffs between response quality, latency, and system throughput at scale. For a new request, EchoLM identifies similar, high-utility examples and efficiently prepends them to the input for better response. At scale, EchoLM adaptively routes requests to LLMs of varying capabilities, accounting for response quality and serving loads. EchoLM employs a cost-aware cache replay mechanism to improve example quality and coverage offline, maximizing cache utility and runtime efficiency. Evaluations on millions of open-source requests demonstrate that EchoLM has a throughput improvement of 1.4-5.9x while reducing latency by 28-71% without hurting response quality on average.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Resource Consumption

Processing Speed

Innovation

Methods, ideas, or system contributions that make the work stand out.

EchoLM

Real-time Knowledge Sharing

Dynamic Request Allocation

🔎 Similar Papers

No similar papers found.