Auto-ARGUE: LLM-Based Report Generation Evaluation

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

144K/year

🤖 AI Summary

Existing RAG evaluation tools lack suitability for long-form, citation-augmented report generation tasks. This paper introduces ARGUE, the first open-source automated evaluation framework specifically designed for this task. ARGUE employs large language models as core evaluators, integrating retrieval augmentation and human-annotated reference benchmarks, and is validated on the TREC 2024 NeuCLIR dataset. It enables fine-grained, interpretable, report-level assessment across multiple dimensions—including citation accuracy, content faithfulness, and structural coherence—for the first time. A companion open-source web application provides interactive visualization, substantially enhancing evaluation transparency. In the TREC 2024 Report Generation Pilot Task, ARGUE achieves strong system-level correlation with human judgments (Spearman ρ > 0.92), demonstrating high reliability and stability. This work fills a critical gap in the evaluation of RAG-based report generation systems.

Technology Category

Application Category

📝 Abstract

Generation of long-form, citation-backed reports is a primary use case for retrieval augmented generation (RAG) systems. While open-source evaluation tools exist for various RAG tasks, ones tailored to report generation are lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based implementation of the recent ARGUE framework for report generation evaluation. We present analysis of Auto-ARGUE on the report generation pilot task from the TREC 2024 NeuCLIR track, showing good system-level correlations with human judgments. We further release a web app for visualization of Auto-ARGUE outputs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating long-form citation-backed report generation systems

Addressing lack of tailored evaluation tools for RAG reports

Measuring system-level correlation with human judgment assessments

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based implementation of ARGUE framework

Evaluates citation-backed report generation systems

Provides web app for visualization of outputs

🔎 Similar Papers

No similar papers found.