Identifying Reliable Evaluation Metrics for Scientific Text Revision

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

To address the lack of reliable automatic evaluation metrics for scientific text revision, this paper systematically exposes the limitations of similarity-based metrics—such as ROUGE and BERTScore—in capturing *meaningful* improvements. We propose the first hybrid evaluation framework integrating LLM-as-a-judge with domain-customized reference-free metrics, supporting both reference-based and reference-free settings, and jointly modeling *relevance* and *factual correctness*. Human annotation experiments reveal that while LLMs excel at assessing instruction adherence, they exhibit weak factual verification capability—highlighting the necessity of complementary domain-specific signals. Evaluated across multiple scientific revision datasets, our framework achieves the highest agreement with human judgments (Spearman ρ > 0.82), significantly outperforming existing methods. This work establishes a robust, dual-dimensional evaluation paradigm tailored to the nuanced requirements of scientific text revision.

Technology Category

Application Category

📝 Abstract

Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision quality.

Problem

Research questions and friction points this paper is trying to address.

Identifying limitations of traditional metrics for scientific text revision

Exploring alternative evaluation methods aligned with human judgments

Assessing hybrid approaches for reliable revision quality evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Manual annotation study for revision quality

Reference-free metrics from NLP domains

Hybrid approach combining LLM and task-specific metrics

🔎 Similar Papers

Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark