ReviewScore: Misinformed Peer Review Detection with Large Language Models

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Declining review quality at AI conferences necessitates identifying misleading review points—either based on erroneous premises (“flaws”) or already addressed by the paper (“issues”). Method: We propose the first fine-grained, premise-level factual evaluation framework that formally defines and quantifies misjudged information in reviews, yielding the ReviewScore metric. Leveraging large language models (LLMs), we automatically reconstruct both explicit and implicit premises, construct an expert-annotated dataset, and conduct factual judgment and human–LLM consistency analysis across eight state-of-the-art LLMs. Contribution/Results: We find that 15.2% of “flaws” and 26.4% of “issues” involve factual misjudgments. LLMs achieve moderate-to-strong inter-rater agreement with human experts at the premise level (Cohen’s κ = 0.42–0.58), significantly outperforming holistic review scoring. This validates the feasibility and effectiveness of automated, interpretable assessment of review quality.

Technology Category

Application Category

📝 Abstract

Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either "weaknesses" in a review that contain incorrect premises, or "questions" in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs and verify moderate agreements. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality. A thorough disagreement analysis further supports a potential of fully automated ReviewScore evaluation.

Problem

Research questions and friction points this paper is trying to address.

Detecting misinformed peer reviews containing incorrect premises

Identifying review questions already answered in the paper

Automating factuality evaluation of review weaknesses using LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated engine reconstructs explicit and implicit premises

LLMs detect misinformed peer reviews using ReviewScore

Premise-level factuality evaluation achieves higher human-model agreement

🔎 Similar Papers

No similar papers found.