SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current large language models (LLMs) lack systematic, multidimensional evaluation criteria for generating academic survey papers. To address this gap, we propose SurveyEval—the first comprehensive benchmark specifically designed for evaluating academic survey generation. SurveyEval assesses outputs across three core dimensions: overall survey quality, outline coherence, and reference accuracy, spanning seven disciplinary domains. Methodologically, it innovatively integrates retrieval-augmented generation (RAG), long-context evaluation techniques, and an enhanced LLM-as-a-Judge framework, augmented with human-annotated references to improve alignment between automated metrics and human judgment. Experimental results demonstrate that domain-specific survey generation systems significantly outperform general-purpose long-text or academic writing models. SurveyEval exhibits strong discriminative power and scalability, offering a reliable, open, and extensible evaluation platform for future research in academic survey generation.

Technology Category

Application Category

📝 Abstract

LLM-based automatic survey systems are transforming how users acquire information from the web by integrating retrieval, organization, and content synthesis into end-to-end generation pipelines. While recent works focus on developing new generation pipelines, how to evaluate such complex systems remains a significant challenge. To this end, we introduce SurveyEval, a comprehensive benchmark that evaluates automatically generated surveys across three dimensions: overall quality, outline coherence, and reference accuracy. We extend the evaluation across 7 subjects and augment the LLM-as-a-Judge framework with human references to strengthen evaluation-human alignment. Evaluation results show that while general long-text or paper-writing systems tend to produce lower-quality surveys, specialized survey-generation systems are able to deliver substantially higher-quality results. We envision SurveyEval as a scalable testbed to understand and improve automatic survey systems across diverse subjects and evaluation criteria.

Problem

Research questions and friction points this paper is trying to address.

Evaluating complex LLM-generated academic survey systems comprehensively.

Assessing survey quality, outline coherence, and reference accuracy across subjects.

Enhancing evaluation-human alignment using benchmarks and human references.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark for evaluating LLM-generated surveys

Extends evaluation across 7 subjects with human references

Augments LLM-as-a-Judge framework for better alignment

🔎 Similar Papers

No similar papers found.

Authors to Follow