🤖 AI Summary
Existing RAG evaluation tools lack suitability for long-form, citation-augmented report generation tasks. This paper introduces ARGUE, the first open-source automated evaluation framework specifically designed for this task. ARGUE employs large language models as core evaluators, integrating retrieval augmentation and human-annotated reference benchmarks, and is validated on the TREC 2024 NeuCLIR dataset. It enables fine-grained, interpretable, report-level assessment across multiple dimensions—including citation accuracy, content faithfulness, and structural coherence—for the first time. A companion open-source web application provides interactive visualization, substantially enhancing evaluation transparency. In the TREC 2024 Report Generation Pilot Task, ARGUE achieves strong system-level correlation with human judgments (Spearman ρ > 0.92), demonstrating high reliability and stability. This work fills a critical gap in the evaluation of RAG-based report generation systems.
📝 Abstract
Generation of long-form, citation-backed reports is a primary use case for retrieval augmented generation (RAG) systems. While open-source evaluation tools exist for various RAG tasks, ones tailored to report generation are lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based implementation of the recent ARGUE framework for report generation evaluation. We present analysis of Auto-ARGUE on the report generation pilot task from the TREC 2024 NeuCLIR track, showing good system-level correlations with human judgments. We further release a web app for visualization of Auto-ARGUE outputs.