Semantic Caching of Contextual Summaries for Efficient Question-Answering with Language Models

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

To address the high computational overhead, memory consumption, and bandwidth usage incurred by large language models (LLMs) when processing long contexts in edge–cloud collaborative settings, this paper proposes a semantic caching mechanism tailored for retrieval-augmented generation (RAG). The core innovation lies in introducing intermediate-context semantic summaries as cacheable and matchable representations—marking the first such use of semantic summarization for caching. Leveraging semantic similarity hashing and summary embedding matching, the method constructs an efficient cache index enabling cross-query context reuse. Crucially, it preserves answer accuracy comparable to full-document processing while substantially reducing redundant computation. Experimental evaluation on NaturalQuestions, TriviaQA, and an ArXiv synthetic dataset demonstrates a 50–60% reduction in redundant computation, validating both effectiveness and practicality.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly deployed across edge and cloud platforms for real-time question-answering and retrieval-augmented generation. However, processing lengthy contexts in distributed systems incurs high computational overhead, memory usage, and network bandwidth. This paper introduces a novel semantic caching approach for storing and reusing intermediate contextual summaries, enabling efficient information reuse across similar queries in LLM-based QA workflows. Our method reduces redundant computations by up to 50-60% while maintaining answer accuracy comparable to full document processing, as demonstrated on NaturalQuestions, TriviaQA, and a synthetic ArXiv dataset. This approach balances computational cost and response quality, critical for real-time AI assistants.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in LLM-based QA workflows

Minimizes memory and network bandwidth usage for lengthy contexts

Balances response quality and cost for real-time AI assistants

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic caching for contextual summaries reuse

Reduces redundant computations by 50-60%

Balances computational cost and response quality

🔎 Similar Papers

Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints