🤖 AI Summary
This study addresses the critical shortage of human reviewers in peer review and the limitations of existing automated evaluation methods, which struggle to accurately assess the logical support between claims and evidence in review comments. To overcome this, the work proposes an interpretable automatic evaluation framework that explicitly models the “warrant”—the inferential link connecting claims and evidence—going beyond conventional approaches that merely detect the presence of evidence. Leveraging language models to extract claims and supporting evidence, the method introduces a novel quantitative metric, WarrantScore, to measure the strength of reasoning. Experimental results demonstrate that this approach achieves significantly higher correlation with human judgments than current state-of-the-art methods, thereby enhancing both the accuracy and efficiency of automated support for peer review.
📝 Abstract
The scientific peer-review process is facing a shortage of human resources due to the rapid growth in the number of submitted papers. The use of language models to reduce the human cost of peer review has been actively explored as a potential solution to this challenge. A method has been proposed to evaluate the level of substantiation in scientific reviews in a manner that is interpretable by humans. This method extracts the core components of an argument, claims and evidence, and assesses the level of substantiation based on the proportion of claims supported by evidence. The level of substantiation refers to the extent to which claims are based on objective facts. However, when assessing the level of substantiation, simply detecting the presence or absence of supporting evidence for a claim is insufficient; it is also necessary to accurately assess the logical inference between a claim and its evidence. We propose a new evaluation metric for scientific review comments that assesses the logical inference between claims and evidence. Experimental results show that the proposed method achieves a higher correlation with human scores than conventional methods, indicating its potential to better support the efficiency of the peer-review process.