GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering

📅 2024-09-10

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Existing RAG evaluation methods rely on LLM-as-a-Judge (e.g., GPT-4) but overlook its systematic deficiencies in calibration and discriminative power, leading to inadequate identification of generator failure modes. Method: We propose GroUSE—the first meta-evaluation benchmark for judge models—comprising 144 unit tests covering seven canonical RAG failure modes. We innovatively distill GPT-4’s reasoning traces to model judge behavior and fine-tune Llama-3 to enhance calibration. Contribution/Results: Experiments reveal poor generalization of mainstream open-source judge models. Fine-tuned Llama-3 achieves significantly improved agreement with GPT-4 (+28.6% Kendall τ) and superior calibration. GroUSE precisely identifies evaluation blind spots, offering an interpretable, reproducible, and principled evaluation paradigm for RAG judge models.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) has emerged as a common paradigm to use Large Language Models (LLMs) alongside private and up-to-date knowledge bases. In this work, we address the challenges of using LLM-as-a-Judge when evaluating grounded answers generated by RAG systems. To assess the calibration and discrimination capabilities of judge models, we identify 7 generator failure modes and introduce GroUSE (Grounded QA Unitary Scoring of Evaluators), a meta-evaluation benchmark of 144 unit tests. This benchmark reveals that existing automated RAG evaluation frameworks often overlook important failure modes, even when using GPT-4 as a judge. To improve on the current design of automated RAG evaluation frameworks, we propose a novel pipeline and find that while closed models perform well on GroUSE, state-of-the-art open-source judges do not generalize to our proposed criteria, despite strong correlation with GPT-4's judgement. Our findings suggest that correlation with GPT-4 is an incomplete proxy for the practical performance of judge models and should be supplemented with evaluations on unit tests for precise failure mode detection. We further show that finetuning Llama-3 on GPT-4's reasoning traces significantly boosts its evaluation capabilities, improving upon both correlation with GPT-4's evaluations and calibration on reference situations.

Problem

Research questions and friction points this paper is trying to address.

RAG Evaluation

Large Language Models

Accuracy and Coverage

Innovation

Methods, ideas, or system contributions that make the work stand out.

GroUSE Framework

Enhanced Automated Evaluation

Llama-3 Fine-tuning

🔎 Similar Papers

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions