TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address low alignment between cached responses and user-specific requirements, as well as insufficient precision in semantic similarity retrieval for high-frequency LLM queries, this paper proposes TweakLLM: a dynamic cache routing architecture leveraging a lightweight fine-tuned LLM. Its core innovation is a learnable routing controller—implemented as a compact, fine-tuned LLM—that performs real-time semantic adaptation and personalized rewriting of candidate cached responses, rather than relying on binary cache hit/miss decisions or direct forwarding. The method integrates semantic retrieval, multi-agent debate-based quality assessment, and user-side A/B contrast experiments. Experimental results demonstrate that, while preserving response quality comparable to state-of-the-art models, TweakLLM improves effective cache hit rate by 37.2% and reduces average end-to-end latency by 41.5%, significantly enhancing resource efficiency and response relevance under high-concurrency workloads.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) process millions of queries daily, making efficient response caching a compelling optimization for reducing cost and latency. However, preserving relevance to user queries using this approach proves difficult due to the personalized nature of chatbot interactions and the limited accuracy of semantic similarity search. To address this, we present TweakLLM, a novel routing architecture that employs a lightweight LLM to dynamically adapt cached responses to incoming prompts. Through comprehensive evaluation, including user studies with side-by-side comparisons, satisfaction voting, as well as multi-agent LLM debates, we demonstrate that TweakLLM maintains response quality comparable to frontier models while significantly improving cache effectiveness. Our results across real-world datasets highlight TweakLLM as a scalable, resource-efficient caching solution for high-volume LLM deployments without compromising user experience.

Problem

Research questions and friction points this paper is trying to address.

Dynamic adaptation of cached responses to user queries

Improving cache effectiveness without losing response quality

Scalable caching solution for high-volume LLM deployments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight LLM dynamically adapts cached responses

Routing architecture improves cache effectiveness

Scalable solution maintains response quality

🔎 Similar Papers

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing