🤖 AI Summary
Existing RAG evaluation benchmarks emphasize local text chunk retrieval and thus fail to assess global reasoning capabilities across document collections—such as corpus-level statistics, extremum identification, and ranking. To address this gap, we introduce GlobalQA, the first benchmark explicitly designed for evaluating global RAG systems, covering four task categories: counting, extremum detection, sorting, and Top-k retrieval. We further propose GlobalRAG, a novel framework featuring a multi-tool collaborative architecture: an LLM-driven intelligent noise filter, a chunk-based retriever, and a symbolic aggregation module enabling structured extraction and joint semantic-symbolic reasoning. Experiments demonstrate that state-of-the-art methods achieve only 1.51 F1 on GlobalQA, whereas GlobalRAG attains 6.63 F1—a substantial improvement—thereby enabling systematic assessment and modeling of large-scale corpus-level understanding and analytical reasoning.
📝 Abstract
Retrieval-augmented generation (RAG) has emerged as a leading approach to reducing hallucinations in large language models (LLMs). Current RAG evaluation benchmarks primarily focus on what we call local RAG: retrieving relevant chunks from a small subset of documents to answer queries that require only localized understanding within specific text chunks. However, many real-world applications require a fundamentally different capability -- global RAG -- which involves aggregating and analyzing information across entire document collections to derive corpus-level insights (for example, "What are the top 10 most cited papers in 2023?"). In this paper, we introduce GlobalQA -- the first benchmark specifically designed to evaluate global RAG capabilities, covering four core task types: counting, extremum queries, sorting, and top-k extraction. Through systematic evaluation across different models and baselines, we find that existing RAG methods perform poorly on global tasks, with the strongest baseline achieving only 1.51 F1 score. To address these challenges, we propose GlobalRAG, a multi-tool collaborative framework that preserves structural coherence through chunk-level retrieval, incorporates LLM-driven intelligent filters to eliminate noisy documents, and integrates aggregation modules for precise symbolic computation. On the Qwen2.5-14B model, GlobalRAG achieves 6.63 F1 compared to the strongest baseline's 1.51 F1, validating the effectiveness of our method.