🤖 AI Summary
Existing RAG evaluation frameworks predominantly focus on single-source, short-answer scenarios, failing to adequately assess systems’ capabilities in multi-source information integration and coherent long-text generation.
Method: We propose the first scalable evaluation framework for Multi-Source Retrieval and Synthesis (MSRS), introducing two new benchmarks—MSRS-Story and MSRS-Meet—to systematically evaluate RAG’s cross-document information fusion and long-answer coherence. Our approach combines sparse and dense retrieval, and employs oracle retrieval experiments to decouple retrieval quality from generative capability.
Contribution/Results: Empirical results demonstrate that answer quality is highly dependent on retrieval effectiveness; reasoning-oriented large language models significantly outperform standard LLMs on multi-source synthesis tasks—even under ideal (oracle) retrieval conditions, confirming the robustness of this advantage. This work establishes novel benchmarks and provides critical insights for advancing RAG evaluation paradigms.
📝 Abstract
Retrieval-augmented systems are typically evaluated in settings where information required to answer the query can be found within a single source or the answer is short-form or factoid-based. However, many real-world applications demand the ability to integrate and summarize information scattered across multiple sources, where no single source is sufficient to respond to the user's question. In such settings, the retrieval component of a RAG pipeline must recognize a variety of relevance signals, and the generation component must connect and synthesize information across multiple sources. We present a scalable framework for constructing evaluation benchmarks that challenge RAG systems to integrate information across distinct sources and generate long-form responses. Using our framework, we build two new benchmarks on Multi-Source Retrieval and Synthesis: MSRS-Story and MSRS-Meet, representing narrative synthesis and summarization tasks, respectively, that require retrieval from large collections. Our extensive experiments with various RAG pipelines -- including sparse and dense retrievers combined with frontier LLMs -- reveal that generation quality is highly dependent on retrieval effectiveness, which varies greatly by task. While multi-source synthesis proves challenging even in an oracle retrieval setting, we find that reasoning models significantly outperform standard LLMs at this distinct step.