Advancing Semantic Caching for LLMs with Domain-Specific Embeddings and Synthetic Data

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the longstanding trade-off among accuracy, query latency, and computational efficiency in semantic caching, this paper proposes a domain-oriented lightweight embedding optimization framework. Methodologically: (1) we introduce the first synthetic data generation pipeline specifically designed for semantic caching; (2) we perform a single-phase fine-tuning of compact domain-specific embedding models, integrating domain adaptation and synthetic data augmentation; and (3) we enable low-overhead semantic similarity retrieval and efficient model deployment. Experiments demonstrate that our approach consistently outperforms state-of-the-art open-source and commercial embedding models in both precision and recall. Empirical evaluation shows over 40% reduction in query latency, substantial improvements in cache hit rate and system throughput, and seamless integration into LLM-driven semantic caching pipelines. The framework establishes a practical, high-performance paradigm for production-grade semantic caching.

Technology Category

Application Category

📝 Abstract
This report investigates enhancing semantic caching effectiveness by employing specialized, fine-tuned embedding models. Semantic caching relies on embedding similarity rather than exact key matching, presenting unique challenges in balancing precision, query latency, and computational efficiency. We propose leveraging smaller, domain-specific embedding models, fine-tuned with targeted real-world and synthetically generated datasets. Our empirical evaluations demonstrate that compact embedding models fine-tuned for just one epoch on specialized datasets significantly surpass both state-of-the-art open-source and proprietary alternatives in precision and recall. Moreover, we introduce a novel synthetic data generation pipeline for the semantic cache that mitigates the challenge of limited domain-specific annotated data, further boosting embedding performance. Our approach effectively balances computational overhead and accuracy, establishing a viable and efficient strategy for practical semantic caching implementations.
Problem

Research questions and friction points this paper is trying to address.

Enhancing semantic caching effectiveness with domain-specific embeddings
Balancing precision, query latency, and computational efficiency in caching
Overcoming limited annotated data via synthetic data generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-specific fine-tuned embedding models
Synthetic data generation pipeline
Balanced computational overhead and accuracy
🔎 Similar Papers
No similar papers found.