Expert Preference-based Evaluation of Automated Related Work Generation

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing automatic evaluation methods for generated “Related Work” sections—such as traditional metrics and LLM-as-a-judge approaches—fail to capture domain-expert preferences and fine-grained quality criteria. Method: We propose GREP, a multi-round evaluation framework that explicitly incorporates domain-expert preferences via a decomposed, multi-dimensional comparative assessment (e.g., coverage, accuracy, logical coherence) and supports human-AI collaborative feedback. Technically, GREP implements both closed- and open-source LLM-based systems using contrastive few-shot prompting and quantifiable scoring outputs. Contribution/Results: Experiments demonstrate GREP’s robustness and high agreement with human expert judgments (Spearman’s ρ > 0.85), significantly outperforming baseline methods. Moreover, GREP precisely identifies structural deficiencies—e.g., factual inconsistency, conceptual fragmentation—in state-of-the-art models’ academic survey generation, thereby enabling targeted post-training optimization.

Technology Category

Application Category

📝 Abstract

Expert domain writing, such as scientific writing, typically demands extensive domain knowledge. Recent advances in LLMs show promising potential in reducing the expert workload. However, evaluating the quality of automatically generated scientific writing is a crucial open issue, as it requires knowledge of domain-specific evaluation criteria and the ability to discern expert preferences. Conventional automatic metrics and LLM-as-a-judge systems are insufficient to grasp expert preferences and domain-specific quality standards. To address this gap and support human-AI collaborative writing, we focus on related work generation, one of the most challenging scientific tasks, as an exemplar. We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences. Instead of assigning a single score, our framework decomposes the evaluation into fine-grained dimensions. This localized evaluation approach is further augmented with contrastive few-shot examples to provide detailed contextual guidance for the evaluation dimensions. The design principles allow our framework to deliver cardinal assessment of quality, which can facilitate better post-training compared to ordinal preference data. For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs. Empirical investigation reveals that our framework is able to assess the quality of related work sections in a much more robust manner compared to standard LLM judges, reflects natural scenarios of scientific writing, and bears a strong correlation with the human expert assessment. We also observe that generations from state-of-the-art LLMs struggle to satisfy validation constraints of a suitable related work section. They (mostly) fail to improve based on feedback as well.

Problem

Research questions and friction points this paper is trying to address.

Evaluating quality of AI-generated scientific writing lacks expert criteria

Current metrics fail to capture domain-specific standards and preferences

Automated related work generation needs robust expert-aligned evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn evaluation framework with expert preferences

Fine-grained dimensions for localized evaluation

Contrastive few-shot examples for contextual guidance

🔎 Similar Papers

No similar papers found.