HaS: Accelerating RAG through Homology-Aware Speculative Retrieval

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the significant increase in retrieval latency that plagues retrieval-augmented generation (RAG) systems as knowledge base scale grows, a challenge inadequately mitigated by existing acceleration methods that often sacrifice accuracy or support only exact query repetition. To overcome this limitation, the authors propose HaS, a novel framework that introduces modeling of query homology—the relationship among semantically related queries—and reframes retrieval acceleration as a homologous query re-identification task. HaS employs speculative retrieval to rapidly obtain candidate documents and validates their relevance through homology-aware verification, thereby avoiding exhaustive database scans. Designed with a plug-and-play architecture, HaS seamlessly integrates into existing RAG pipelines, achieving average retrieval latency reductions of 23.74%–36.99% across multiple benchmarks with only 1–2% accuracy degradation, while substantially improving efficiency on complex multi-hop queries.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) expands the knowledge boundary of large language models (LLMs) at inference by retrieving external documents as context. However, retrieval becomes increasingly time-consuming as the knowledge databases grow in size. Existing acceleration strategies either compromise accuracy through approximate retrieval, or achieve marginal gains by reusing results of strictly identical queries. We propose HaS, a homology-aware speculative retrieval framework that performs low-latency speculative retrieval over restricted scopes to obtain candidate documents, followed by validating whether they contain the required knowledge. The validation, grounded in the homology relation between queries, is formulated as a homologous query re-identification task: once a previously observed query is identified as a homologous re-encounter of the incoming query, the draft is deemed acceptable, allowing the system to bypass slow full-database retrieval. Benefiting from the prevalence of homologous queries under real-world popularity patterns, HaS achieves substantial efficiency gains. Extensive experiments demonstrate that HaS reduces retrieval latency by 23.74% and 36.99% across datasets with only a 1-2% marginal accuracy drop. As a plug-and-play solution, HaS also significantly accelerates complex multi-hop queries in modern agentic RAG pipelines. Source code is available at: https://github.com/ErrEqualsNil/HaS.

Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Generation

retrieval latency

homologous queries

knowledge retrieval

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

homology-aware retrieval

speculative retrieval

RAG acceleration