🤖 AI Summary
Declining review quality at AI conferences necessitates identifying misleading review points—either based on erroneous premises (“flaws”) or already addressed by the paper (“issues”).
Method: We propose the first fine-grained, premise-level factual evaluation framework that formally defines and quantifies misjudged information in reviews, yielding the ReviewScore metric. Leveraging large language models (LLMs), we automatically reconstruct both explicit and implicit premises, construct an expert-annotated dataset, and conduct factual judgment and human–LLM consistency analysis across eight state-of-the-art LLMs.
Contribution/Results: We find that 15.2% of “flaws” and 26.4% of “issues” involve factual misjudgments. LLMs achieve moderate-to-strong inter-rater agreement with human experts at the premise level (Cohen’s κ = 0.42–0.58), significantly outperforming holistic review scoring. This validates the feasibility and effectiveness of automated, interpretable assessment of review quality.
📝 Abstract
Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either "weaknesses" in a review that contain incorrect premises, or "questions" in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs and verify moderate agreements. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality. A thorough disagreement analysis further supports a potential of fully automated ReviewScore evaluation.