Route Before Retrieve: Activating Latent Routing Abilities of LLMs for RAG vs. Long-Context Selection

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the inefficiency and lack of interpretability in existing retrieval-augmented generation (RAG) and long-context (LC) approaches, which typically rely on passive and costly strategies. To overcome this limitation, we propose Pre-Route, a framework that performs structured reasoning over lightweight metadata prior to generation to proactively predict task requirements and select the optimal strategy. By leveraging structured prompts to elicit the inherent routing capabilities of large language models, Pre-Route enhances separability along strategy dimensions in the representation space and enables knowledge distillation into smaller models. Experiments demonstrate that Pre-Route significantly outperforms Always-RAG, Always-LC, and Self-Route on LaRA and LongBench-v2 benchmarks, achieving performance close to Best-of-N under single-shot settings while offering high cost-effectiveness and scalability.

📝 Abstract

Recent advances in large language models (LLMs) have expanded the context window to beyond 128K tokens, enabling long-document understanding and multi-source reasoning. A key challenge, however, lies in choosing between retrieval-augmented generation (RAG) and long-context (LC) strategies: RAG is efficient but constrained by retrieval quality, while LC supports global reasoning at higher cost and with position sensitivity. Existing methods such as Self-Route adopt failure-driven fallback from RAG to LC, but remain passive, inefficient, and hard to interpret. We propose Pre-Route, a proactive routing framework that performs structured reasoning before answering. Using lightweight metadata (e.g., document type, length, initial snippet), Pre-Route enables task analysis, coverage estimation, and information-need prediction, producing explainable and cost-efficient routing decisions. Our study shows three key findings: (i) LLMs possess latent routing ability that can be reliably elicited with guidelines, allowing single-sample performance to approach that of multi-sample (Best-of-N) results; (ii) linear probes reveal that structured prompts sharpen the separability of the "optimal routing dimension" in representation space; and (iii) distillation transfers this reasoning structure to smaller models for lightweight deployment. Experiments on LaRA (in-domain) and LongBench-v2 (OOD) confirm that Pre-Route outperforms Always-RAG, Always-LC, and Self-Route baselines, achieving superior overall cost-effectiveness.

Problem

Research questions and friction points this paper is trying to address.

RAG

Long-Context

Routing

LLMs

Cost-effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-Route

retrieval-augmented generation

long-context reasoning