DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving

📅 2024-11-05

📈 Citations: 3

✨ Influential: 0

🤖 AI Summary

To address the efficiency bottleneck in multi-LLM collaborative inference caused by redundant context computation, this paper proposes a KV cache sharing framework tailored for homologous fine-tuned LLMs. The method introduces two key innovations: (1) a cross-LLM communication-guided hierarchical KV cache identification mechanism that precisely pinpoints reusable key layers; and (2) a selective recomputation strategy that minimizes redundant computation while preserving accuracy. Importantly, the approach requires no architectural modifications or retraining, and supports heterogeneous service scheduling. Experiments across multiple datasets and model pairs demonstrate up to 3.0× throughput improvement, 2.6× speedup in the prefill phase, and negligible accuracy degradation (<0.1%).

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly employed in complex workflows, where different LLMs and fine-tuned variants collaboratively address complex tasks. However, these systems face significant inefficiencies due to redundant context processing of the shared context. We propose DroidSpeak, a framework that optimizes context sharing between fine-tuned LLMs derived from the same foundational model. DroidSpeak identifies critical layers in the KV cache and selectively recomputes them, enabling effective reuse of intermediate data while maintaining high accuracy. Our approach balances computational efficiency and task fidelity, significantly reducing inference latency and throughput bottlenecks. Experiments on diverse datasets and model pairs demonstrate that DroidSpeak achieves up to 3x higher throughputs and 2.6x faster prefill times with negligible accuracy loss compared to full recomputation.

Problem

Research questions and friction points this paper is trying to address.

Enable KV cache reuse across different LLMs with same architecture

Study impact of sharing KV caches on model quality

Improve inference performance via selective recomputation and pipelining

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache sharing across different LLMs

Selective recomputation of KV cache layers

Pipelining re-computation and KV cache loading

🔎 Similar Papers

No similar papers found.

Authors to Follow