๐ค AI Summary
Current large language models (LLMs) lack systematic, multidimensional evaluation criteria for generating academic survey papers. To address this gap, we propose SurveyEvalโthe first comprehensive benchmark specifically designed for evaluating academic survey generation. SurveyEval assesses outputs across three core dimensions: overall survey quality, outline coherence, and reference accuracy, spanning seven disciplinary domains. Methodologically, it innovatively integrates retrieval-augmented generation (RAG), long-context evaluation techniques, and an enhanced LLM-as-a-Judge framework, augmented with human-annotated references to improve alignment between automated metrics and human judgment. Experimental results demonstrate that domain-specific survey generation systems significantly outperform general-purpose long-text or academic writing models. SurveyEval exhibits strong discriminative power and scalability, offering a reliable, open, and extensible evaluation platform for future research in academic survey generation.
๐ Abstract
LLM-based automatic survey systems are transforming how users acquire information from the web by integrating retrieval, organization, and content synthesis into end-to-end generation pipelines. While recent works focus on developing new generation pipelines, how to evaluate such complex systems remains a significant challenge. To this end, we introduce SurveyEval, a comprehensive benchmark that evaluates automatically generated surveys across three dimensions: overall quality, outline coherence, and reference accuracy. We extend the evaluation across 7 subjects and augment the LLM-as-a-Judge framework with human references to strengthen evaluation-human alignment. Evaluation results show that while general long-text or paper-writing systems tend to produce lower-quality surveys, specialized survey-generation systems are able to deliver substantially higher-quality results. We envision SurveyEval as a scalable testbed to understand and improve automatic survey systems across diverse subjects and evaluation criteria.