Doing More with Less -- Implementing Routing Strategies in Large Language Model-Based Systems: An Extended Survey

📅 2025-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language model (LLM) systems commonly suffer from substantial resource waste due to static or suboptimal deployment strategies. Method: This paper proposes a cost-quality–aware query routing mechanism that dynamically dispatches user queries to the most suitable lightweight model, domain-specific expert, or embedding strategy. We formally define the routing problem for the first time and introduce a novel taxonomy jointly optimizing relevance and resource efficiency. Our framework systematically compares academic approaches with industrial practices, incorporating query understanding, policy selection, multi-granularity routing (at both model and embedding levels), fine-grained cost modeling, and a unified evaluation protocol. Contribution/Results: Experiments demonstrate that our mechanism significantly reduces inference overhead—by up to 42% in latency and 38% in GPU memory—while maintaining or even improving answer quality across diverse benchmarks. This work establishes both a theoretical foundation and a reproducible, practical paradigm for building efficient, scalable, and cost-effective LLM systems.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLM)-based systems, i.e. interconnected elements that include an LLM as a central component (e.g., conversational agents), are typically monolithic static architectures that rely on a single LLM for all user queries. However, they often require different preprocessing strategies, levels of reasoning, or knowledge. Generalist LLMs (i.e. GPT-4), trained on very large multi-topic corpora, can perform well in a variety of tasks. However, they require significant financial, energy, and hardware resources that may not be justified for basic tasks. This implies potentially investing in unnecessary costs for a given query. To overcome this problem, a routing mechanism routes user queries to the most suitable components, such as smaller LLMs or experts in specific topics. This approach may improve response quality while minimising costs. Routing can be expanded to other components of the conversational agent architecture, such as the selection of optimal embedding strategies. This paper explores key considerations for integrating routing into LLM-based systems, focusing on resource management, cost definition, and strategy selection. Our main contributions include a formalisation of the problem, a novel taxonomy of existing approaches emphasising relevance and resource efficiency, and a comparative analysis of these strategies in relation to industry practices. Finally, we identify critical challenges and directions for future research.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Task Allocation
Efficiency Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimized Resource Allocation
Task Assignment Strategies
Enhanced Understanding in Chatbots
🔎 Similar Papers
No similar papers found.
C
C. Varangot-Reille
Wikit, Lyon, France; Laboratoire Hubert Curien, UMR CNRS 5516, Saint-Etienne, France
C
Christophe Bouvard
Wikit, Lyon, France
Antoine Gourru
Antoine Gourru
Associate professor, University Jean Monnet of Saint-Etienne (France)
Machine LearningNatural Language ProcessingFairness
M
Mathieu Ciancone
Wikit, Lyon, France
M
Marion Schaeffer
Wikit, Lyon, France
F
François Jacquenet
Laboratoire Hubert Curien, UMR CNRS 5516, Saint-Etienne, France