🤖 AI Summary
Existing RAG methods neglect LLMs’ intrinsic knowledge during selective retrieval, leading to low-quality retrievals that interfere with answer generation and rigid knowledge-source selection. To address this, we propose *Self-Routing*, a mechanism enabling the model to autonomously decide between external retrieval and internal parametric knowledge for answer generation—jointly optimizing knowledge-source selection and natural, verbalized response generation. We further introduce dynamic nearest-neighbor search to mitigate domain shift. Through multi-task fine-tuning—jointly modeling knowledge-source classification, verbalization, and response generation—and dynamic nearest-neighbor inference, our method improves average response accuracy by 5.1% across three mainstream LLMs, reduces inference latency, and cuts redundant retrieval calls by 29%. Our core contribution is the first integration of LLM knowledge confidence modeling into routing decisions, enabling end-to-end differentiable, adaptive retrieval augmentation.
📝 Abstract
Selective retrieval improves retrieval-augmented generation (RAG) by reducing distractions from low-quality retrievals and improving efficiency. However, existing approaches under-utilize the inherent knowledge of large language models (LLMs), leading to suboptimal retrieval decisions and degraded generation performance. To bridge this gap, we propose Self-Routing RAG (SR-RAG), a novel framework that binds selective retrieval with knowledge verbalization. SR-RAG enables an LLM to dynamically decide between external retrieval and verbalizing its own parametric knowledge. To this end, we design a multi-task objective that jointly optimizes an LLM on knowledge source selection, knowledge verbalization, and response generation. We further introduce dynamic knowledge source inference via nearest neighbor search to improve the accuracy of knowledge source decision under domain shifts. Fine-tuning three LLMs with SR-RAG significantly improves both their response accuracy and inference latency. Compared to the strongest selective retrieval baseline, SR-RAG reduces retrievals by 29% while improving the performance by 5.1%.