Structured RAG for Answering Aggregative Questions

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Existing RAG methods struggle with complex queries requiring information aggregation and multi-step reasoning across large-scale document collections. To address this, we propose S-RAG—a novel framework systematically designed for aggregation-centric question answering. S-RAG constructs a structured knowledge representation of the corpus, automatically translates natural language queries into formal query representations, and synergistically integrates retrieval-augmented generation with long-context large language models to enable robust multi-source information fusion and logical reasoning. To advance research in this direction, we introduce two new benchmark datasets—HOTELS and WORLD CUP—specifically curated to evaluate aggregation capabilities. Extensive experiments demonstrate that S-RAG significantly outperforms conventional RAG systems and standalone long-context LLMs on both our benchmarks and established public datasets, validating its effectiveness and state-of-the-art performance for information aggregation tasks.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) has become the dominant approach for answering questions over large corpora. However, current datasets and methods are highly focused on cases where only a small part of the corpus (usually a few paragraphs) is relevant per query, and fail to capture the rich world of aggregative queries. These require gathering information from a large set of documents and reasoning over them. To address this gap, we propose S-RAG, an approach specifically designed for such queries. At ingestion time, S-RAG constructs a structured representation of the corpus; at inference time, it translates natural-language queries into formal queries over said representation. To validate our approach and promote further research in this area, we introduce two new datasets of aggregative queries: HOTELS and WORLD CUP. Experiments with S-RAG on the newly introduced datasets, as well as on a public benchmark, demonstrate that it substantially outperforms both common RAG systems and long-context LLMs.

Problem

Research questions and friction points this paper is trying to address.

Addresses aggregative queries requiring multi-document information gathering

Proposes structured RAG approach for corpus representation and formal querying

Introduces datasets to validate performance on aggregative question answering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured corpus representation for information aggregation

Natural language to formal query translation mechanism

Specialized datasets for validating aggregative question answering

🔎 Similar Papers

Does RAG Introduce Unfairness in LLMs? Evaluating Fairness in Retrieval-Augmented Generation Systems