TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses

📅 2025-07-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low alignment between cached responses and user-specific requirements, as well as insufficient precision in semantic similarity retrieval for high-frequency LLM queries, this paper proposes TweakLLM: a dynamic cache routing architecture leveraging a lightweight fine-tuned LLM. Its core innovation is a learnable routing controller—implemented as a compact, fine-tuned LLM—that performs real-time semantic adaptation and personalized rewriting of candidate cached responses, rather than relying on binary cache hit/miss decisions or direct forwarding. The method integrates semantic retrieval, multi-agent debate-based quality assessment, and user-side A/B contrast experiments. Experimental results demonstrate that, while preserving response quality comparable to state-of-the-art models, TweakLLM improves effective cache hit rate by 37.2% and reduces average end-to-end latency by 41.5%, significantly enhancing resource efficiency and response relevance under high-concurrency workloads.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) process millions of queries daily, making efficient response caching a compelling optimization for reducing cost and latency. However, preserving relevance to user queries using this approach proves difficult due to the personalized nature of chatbot interactions and the limited accuracy of semantic similarity search. To address this, we present TweakLLM, a novel routing architecture that employs a lightweight LLM to dynamically adapt cached responses to incoming prompts. Through comprehensive evaluation, including user studies with side-by-side comparisons, satisfaction voting, as well as multi-agent LLM debates, we demonstrate that TweakLLM maintains response quality comparable to frontier models while significantly improving cache effectiveness. Our results across real-world datasets highlight TweakLLM as a scalable, resource-efficient caching solution for high-volume LLM deployments without compromising user experience.
Problem

Research questions and friction points this paper is trying to address.

Dynamic adaptation of cached responses to user queries
Improving cache effectiveness without losing response quality
Scalable caching solution for high-volume LLM deployments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight LLM dynamically adapts cached responses
Routing architecture improves cache effectiveness
Scalable solution maintains response quality
M
Muhammad Taha Cheema
Department of Computer Science, Lahore University of Management Sciences, Lahore, Pakistan
A
Abeer Aamir
Department of Computer Science, Lahore University of Management Sciences, Lahore, Pakistan
K
Khawaja Gul Muhammad
Department of Computer Science, Lahore University of Management Sciences, Lahore, Pakistan
Naveed Anwar Bhatti
Naveed Anwar Bhatti
Assistant Professor, LUMS, Lahore
Cyber Physical SystemsInternet of ThingsEmbedded SystemsWireless Sensor NetworksSecurity
Ihsan Ayyub Qazi
Ihsan Ayyub Qazi
Full Professor of Computer Science, LUMS
Networked Systems (Digital DevelopmentMisinformationGenAIDigital Health)
Zafar Ayyub Qazi
Zafar Ayyub Qazi
Associate Professor, LUMS
Networked Systems