ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

143K/year

🤖 AI Summary

This study addresses the superficiality and lack of evidence-based reasoning commonly observed in peer-review comments generated by current large language models. To overcome this limitation, the authors propose a staged, tool-augmented multi-agent review framework: an initial draft is produced by Phi-4-14B, which is subsequently refined by GPT-OSS-120B leveraging external tools to anchor critiques in verifiable evidence, thereby enhancing both depth and traceability. The work also introduces REVIEWBENCH, the first fine-grained scoring benchmark constructed from official review guidelines and human reviewer data. Experimental results demonstrate that the proposed approach significantly outperforms stronger base models—including GPT-4.1 and DeepSeek-R1-670B—on REVIEWBENCH, achieving human-like review quality across eight distinct dimensions.

Technology Category

Application Category

📝 Abstract

The rapid rise in AI conference submissions has driven increasing exploration of large language models (LLMs) for peer review support. However, LLM-based reviewers often generate superficial, formulaic comments lacking substantive, evidence-grounded feedback. We attribute this to the underutilization of two key components of human reviewing: explicit rubrics and contextual grounding in existing work. To address this, we introduce REVIEWBENCH, a benchmark evaluating review text according to paper-specific rubrics derived from official guidelines, the paper's content, and human-written reviews. We further propose REVIEWGROUNDER, a rubric-guided, tool-integrated multi-agent framework that decomposes reviewing into drafting and grounding stages, enriching shallow drafts via targeted evidence consolidation. Experiments on REVIEWBENCH show that REVIEWGROUNDER, using a Phi-4-14B-based drafter and a GPT-OSS-120B-based grounding stage, consistently outperforms baselines with substantially stronger/larger backbones (e.g., GPT-4.1 and DeepSeek-R1-670B) in both alignment with human judgments and rubric-based review quality across 8 dimensions. The code is available \href{https://github.com/EigenTom/ReviewGrounder}{here}.

Problem

Research questions and friction points this paper is trying to address.

peer review

large language models

substantive feedback

evidence-grounded

review quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

rubric-guided review

tool-integrated agents

evidence grounding