🤖 AI Summary
To address the lack of reliable automatic evaluation metrics for scientific text revision, this paper systematically exposes the limitations of similarity-based metrics—such as ROUGE and BERTScore—in capturing *meaningful* improvements. We propose the first hybrid evaluation framework integrating LLM-as-a-judge with domain-customized reference-free metrics, supporting both reference-based and reference-free settings, and jointly modeling *relevance* and *factual correctness*. Human annotation experiments reveal that while LLMs excel at assessing instruction adherence, they exhibit weak factual verification capability—highlighting the necessity of complementary domain-specific signals. Evaluated across multiple scientific revision datasets, our framework achieves the highest agreement with human judgments (Spearman ρ > 0.82), significantly outperforming existing methods. This work establishes a robust, dual-dimensional evaluation paradigm tailored to the nuanced requirements of scientific text revision.
📝 Abstract
Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision quality.