🤖 AI Summary
This work addresses the limitations of existing lightweight GraphRAG approaches, which rely solely on entity co-occurrence structures and struggle to capture latent semantic relationships between disconnected entities, thereby constraining multi-hop reasoning capabilities. To overcome this, the authors propose a dual-path hypergraph construction method that integrates sentence-level co-occurrence hyperedges with semantic hyperedges derived from entity embedding clustering. They further introduce a hybrid diffusion retrieval mechanism combining topic-aware scoring with personalized PageRank. This approach effectively bridges the gap between structural and semantic information while maintaining linear indexing complexity and zero token overhead for graph construction. Extensive experiments demonstrate that the method significantly outperforms state-of-the-art baselines across four benchmark datasets, achieving both high efficiency and accuracy in multi-hop reasoning tasks.
📝 Abstract
Graph-based Retrieval-Augmented Generation (GraphRAG) enhances LLMs by structuring corpus into graphs to facilitate multi-hop reasoning. While recent lightweight approaches reduce indexing costs by leveraging Named Entity Recognition (NER), they rely strictly on structural co-occurrence, failing to capture latent semantic connections between disjoint entities. To address this, we propose EHRAG, a lightweight RAG framework that constructs a hypergraph capturing both structure and semantic level relationships, employing a hybrid structural-semantic retrieval mechanism. Specifically, EHRAG constructs structural hyperedges based on sentence-level co-occurrence with lightweight entity extraction and semantic hyperedges by clustering entity text embeddings, ensuring the hypergraph encompasses both structural and semantic information. For retrieval, EHRAG performs a structure-semantic hybrid diffusion with topic-aware scoring and personalized pagerank (PPR) refinement to identify the top-k relevant documents. Experiments on four datasets show that EHRAG outperforms state-of-the-art baselines while maintaining linear indexing complexity and zero token consumption for construction. Code is available at https://github.com/yfsong00/EHRAG.