Towards Global Retrieval Augmented Generation: A Benchmark for Corpus-Level Reasoning

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing RAG evaluation benchmarks emphasize local text chunk retrieval and thus fail to assess global reasoning capabilities across document collections—such as corpus-level statistics, extremum identification, and ranking. To address this gap, we introduce GlobalQA, the first benchmark explicitly designed for evaluating global RAG systems, covering four task categories: counting, extremum detection, sorting, and Top-k retrieval. We further propose GlobalRAG, a novel framework featuring a multi-tool collaborative architecture: an LLM-driven intelligent noise filter, a chunk-based retriever, and a symbolic aggregation module enabling structured extraction and joint semantic-symbolic reasoning. Experiments demonstrate that state-of-the-art methods achieve only 1.51 F1 on GlobalQA, whereas GlobalRAG attains 6.63 F1—a substantial improvement—thereby enabling systematic assessment and modeling of large-scale corpus-level understanding and analytical reasoning.

Technology Category

Application Category

📝 Abstract

Retrieval-augmented generation (RAG) has emerged as a leading approach to reducing hallucinations in large language models (LLMs). Current RAG evaluation benchmarks primarily focus on what we call local RAG: retrieving relevant chunks from a small subset of documents to answer queries that require only localized understanding within specific text chunks. However, many real-world applications require a fundamentally different capability -- global RAG -- which involves aggregating and analyzing information across entire document collections to derive corpus-level insights (for example, "What are the top 10 most cited papers in 2023?"). In this paper, we introduce GlobalQA -- the first benchmark specifically designed to evaluate global RAG capabilities, covering four core task types: counting, extremum queries, sorting, and top-k extraction. Through systematic evaluation across different models and baselines, we find that existing RAG methods perform poorly on global tasks, with the strongest baseline achieving only 1.51 F1 score. To address these challenges, we propose GlobalRAG, a multi-tool collaborative framework that preserves structural coherence through chunk-level retrieval, incorporates LLM-driven intelligent filters to eliminate noisy documents, and integrates aggregation modules for precise symbolic computation. On the Qwen2.5-14B model, GlobalRAG achieves 6.63 F1 compared to the strongest baseline's 1.51 F1, validating the effectiveness of our method.

Problem

Research questions and friction points this paper is trying to address.

Benchmark evaluates global RAG for corpus-level reasoning

Addresses poor performance of existing methods on global tasks

Proposes multi-tool framework to enhance information aggregation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-tool framework preserves structural coherence through chunk retrieval

LLM-driven intelligent filters eliminate noisy documents automatically

Aggregation modules enable precise symbolic computation for insights

🔎 Similar Papers

No similar papers found.