🤖 AI Summary
This work investigates the reliability and feasibility of human evaluation in Retrieval-Augmented Generation (RAG) assessment. Addressing the overreliance on LLM-based scoring or automated metrics in existing RAG evaluation, the authors introduce CrowdRAG-25—a novel benchmark corpus comprising 1,806 human- and LLM-generated responses, along with over 57,000 human pairwise judgments across three document genres and seven utility dimensions. It is the first systematic study to empirically validate high inter-annotator agreement (Cohen’s κ = 0.72–0.89) and cost-effectiveness (human annotation costs only 20% of LLM-based evaluation) for both response generation and multi-dimensional utility assessment. CrowdRAG-25 is aligned with the TREC RAG’24 benchmark, and all data and annotation tools are publicly released. The study advances RAG evaluation toward a human-centered, reproducible, and multi-dimensional paradigm.
📝 Abstract
How good are humans at writing and judging responses in retrieval-augmented generation (RAG) scenarios? To answer this question, we investigate the efficacy of crowdsourcing for RAG through two complementary studies: response writing and response utility judgment. We present the Crowd RAG Corpus 2025 (CrowdRAG-25), which consists of 903 human-written and 903 LLM-generated responses for the 301 topics of the TREC RAG'24 track, across the three discourse styles 'bulleted list', 'essay', and 'news'. For a selection of 65 topics, the corpus further contains 47,320 pairwise human judgments and 10,556 pairwise LLM judgments across seven utility dimensions (e.g., coverage and coherence). Our analyses give insights into human writing behavior for RAG and the viability of crowdsourcing for RAG evaluation. Human pairwise judgments provide reliable and cost-effective results compared to LLM-based pairwise or human/LLM-based pointwise judgments, as well as automated comparisons with human-written reference responses. All our data and tools are freely available.