🤖 AI Summary
Existing GraphRAG evaluations rely heavily on generic QA benchmarks, failing to capture improvements in advanced capabilities such as multi-hop reasoning. Method: We introduce the first large-scale, discipline-specific benchmark for GraphRAG—covering 16 academic disciplines and five task formats—to enable end-to-end evaluation of graph construction, retrieval, and generation. It features novel, challenging discipline-oriented questions (e.g., mathematical and programming reasoning), multi-granularity task design, and an interpretability-aware framework for assessing logical reasoning processes. Grounded in knowledge graphs built from 20 core textbooks, it integrates multi-hop reasoning modeling, structured retrieval evaluation, and generation consistency analysis. Contribution/Results: Evaluated across nine state-of-the-art GraphRAG methods, our benchmark quantitatively reveals previously unobserved relationships among graph structural design, retrieval accuracy, and reasoning capability. It provides a reproducible diagnostic toolkit and actionable optimization pathways for GraphRAG development.
📝 Abstract
Graph Retrieval Augmented Generation (GraphRAG) has garnered increasing recognition for its potential to enhance large language models (LLMs) by structurally organizing domain-specific corpora and facilitating complex reasoning. However, current evaluations of GraphRAG models predominantly rely on traditional question-answering datasets. Their limited scope in questions and evaluation metrics fails to comprehensively assess the reasoning capacity improvements enabled by GraphRAG models. To address this gap, we introduce GraphRAG-Bench, a large-scale, domain-specific benchmark designed to rigorously evaluate GraphRAG models. Our benchmark offers three key superiorities: ((i)) Challenging question design. Featuring college-level, domain-specific questions that demand multi-hop reasoning, the benchmark ensures that simple content retrieval is insufficient for problem-solving. For example, some questions require mathematical reasoning or programming. ((ii)) Diverse task coverage. The dataset includes a broad spectrum of reasoning tasks, multiple-choice, true/false, multi-select, open-ended, and fill-in-the-blank. It spans 16 disciplines in twenty core textbooks. ((iii)) Holistic evaluation framework. GraphRAG-Bench provides comprehensive assessment across the entire GraphRAG pipeline, including graph construction, knowledge retrieval, and answer generation. Beyond final-answer correctness, it evaluates the logical coherence of the reasoning process. By applying nine contemporary GraphRAG methods to GraphRAG-Bench, we demonstrate its utility in quantifying how graph-based structuring improves model reasoning capabilities. Our analysis reveals critical insights about graph architectures, retrieval efficacy, and reasoning capabilities, offering actionable guidance for the research community.