HuixiangDou2: A Robustly Optimized GraphRAG Approach

📅 2025-03-09

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Large language models (LLMs) exhibit weak knowledge retrieval capabilities in specialized and emerging domains; existing GraphRAG approaches suffer from high engineering complexity, tight coupling among components, and evaluation contamination due to pretraining data overlap. Method: We propose a robust, dual-level logical GraphRAG framework that constructs a graph-structured knowledge base, integrating coarse-grained global retrieval with fine-grained subgraph logical reasoning. It incorporates 32K-context optimization and a multi-stage lightweight verification mechanism. Contribution/Results: Our approach decouples retrieval components, eliminates data-overlap bias, and—uniquely—synergistically enhances logical formalization and graph-structural modeling. On domain-specific benchmarks, Qwen2.5-7B-Instruct achieves a score improvement from 60 to 74.5, demonstrating substantial gains in fuzzy matching and structured reasoning. The framework is fully open-sourced and reproducible.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) perform well on familiar queries but struggle with specialized or emerging topics. Graph-based Retrieval-Augmented Generation (GraphRAG) addresses this by structuring domain knowledge as a graph for dynamic retrieval. However, existing pipelines involve complex engineering workflows, making it difficult to isolate the impact of individual components. Evaluating retrieval effectiveness is also challenging due to dataset overlap with LLM pretraining data. In this work, we introduce HuixiangDou2, a robustly optimized GraphRAG framework. Specifically, we leverage the effectiveness of dual-level retrieval and optimize its performance in a 32k context for maximum precision, and compare logic-based retrieval and dual-level retrieval to enhance overall functionality. Our implementation includes comparative experiments on a test set, where Qwen2.5-7B-Instruct initially underperformed. With our approach, the score improved significantly from 60 to 74.5, as illustrated in the Figure. Experiments on domain-specific datasets reveal that dual-level retrieval enhances fuzzy matching, while logic-form retrieval improves structured reasoning. Furthermore, we propose a multi-stage verification mechanism to improve retrieval robustness without increasing computational cost. Empirical results show significant accuracy gains over baselines, highlighting the importance of adaptive retrieval. To support research and adoption, we release HuixiangDou2 as an open-source resource https://github.com/tpoisonooo/huixiangdou2.

Problem

Research questions and friction points this paper is trying to address.

Enhance LLM performance on specialized topics

Simplify complex GraphRAG engineering workflows

Improve retrieval robustness and accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimized dual-level retrieval for precision

Multi-stage verification enhances retrieval robustness

Open-source GraphRAG framework for dynamic knowledge

🔎 Similar Papers

No similar papers found.