The Viability of Crowdsourcing for RAG Evaluation

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work investigates the reliability and feasibility of human evaluation in Retrieval-Augmented Generation (RAG) assessment. Addressing the overreliance on LLM-based scoring or automated metrics in existing RAG evaluation, the authors introduce CrowdRAG-25—a novel benchmark corpus comprising 1,806 human- and LLM-generated responses, along with over 57,000 human pairwise judgments across three document genres and seven utility dimensions. It is the first systematic study to empirically validate high inter-annotator agreement (Cohen’s κ = 0.72–0.89) and cost-effectiveness (human annotation costs only 20% of LLM-based evaluation) for both response generation and multi-dimensional utility assessment. CrowdRAG-25 is aligned with the TREC RAG’24 benchmark, and all data and annotation tools are publicly released. The study advances RAG evaluation toward a human-centered, reproducible, and multi-dimensional paradigm.

Technology Category

Application Category

📝 Abstract

How good are humans at writing and judging responses in retrieval-augmented generation (RAG) scenarios? To answer this question, we investigate the efficacy of crowdsourcing for RAG through two complementary studies: response writing and response utility judgment. We present the Crowd RAG Corpus 2025 (CrowdRAG-25), which consists of 903 human-written and 903 LLM-generated responses for the 301 topics of the TREC RAG'24 track, across the three discourse styles 'bulleted list', 'essay', and 'news'. For a selection of 65 topics, the corpus further contains 47,320 pairwise human judgments and 10,556 pairwise LLM judgments across seven utility dimensions (e.g., coverage and coherence). Our analyses give insights into human writing behavior for RAG and the viability of crowdsourcing for RAG evaluation. Human pairwise judgments provide reliable and cost-effective results compared to LLM-based pairwise or human/LLM-based pointwise judgments, as well as automated comparisons with human-written reference responses. All our data and tools are freely available.

Problem

Research questions and friction points this paper is trying to address.

Assessing human ability in RAG response writing and judging

Evaluating crowdsourcing efficacy for RAG evaluation methods

Comparing human and LLM judgments for RAG utility dimensions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes crowdsourcing for RAG evaluation

Compares human and LLM-generated responses

Provides reliable human pairwise judgments

🔎 Similar Papers

No similar papers found.