REMOR: Automated Peer Review Generation with LLM Reasoning and Multi-Objective Reinforcement Learning

๐Ÿ“… 2025-05-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing AI peer-review systems often produce superficial, overly laudatory feedback lacking depth and critical analysis. Method: We propose REMOR, a framework leveraging multi-objective reinforcement learning and a reasoning-augmented large language model (DeepSeek-R1-Distill-Qwen-7B) to learn a human-aligned, multi-dimensional reward functionโ€”HPRRโ€”and introduce PeerRT, the first high-quality peer-review dataset with explicit chain-of-thought rationales. Contribution/Results: Contrary to expectations, the uniformly weighted variant REMOR-U outperforms HPRR in substantive feedback quality, achieving significant gains in depth, criticality, and information density over baselines. Experiments show REMOR-U/H attains average rewards 2.1ร— those of human reviewers, with review quality comparable to top-tier human reviewers while effectively suppressing low-quality, long-tail outputs. All code, the HPRR reward function, the PeerRT dataset, and trained models are publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
AI-based peer review systems tend to produce shallow and overpraising suggestions compared to human feedback. Here, we evaluate how well a reasoning LLM trained with multi-objective reinforcement learning (REMOR) can overcome these limitations. We start by designing a multi-aspect reward function that aligns with human evaluation of reviews. The aspects are related to the review itself (e.g., criticisms, novelty) and the relationship between the review and the manuscript (i.e., relevance). First, we perform supervised fine-tuning of DeepSeek-R1-Distill-Qwen-7B using LoRA on PeerRT, a new dataset of high-quality top AI conference reviews enriched with reasoning traces. We then apply Group Relative Policy Optimization (GRPO) to train two models: REMOR-H (with the human-aligned reward) and REMOR-U (with a uniform reward). Interestingly, the human-aligned reward penalizes aspects typically associated with strong reviews, leading REMOR-U to produce qualitatively more substantive feedback. Our results show that REMOR-U and REMOR-H achieve more than twice the average rewards of human reviews, non-reasoning state-of-the-art agentic multi-modal AI review systems, and general commercial LLM baselines. We found that while the best AI and human reviews are comparable in quality, REMOR avoids the long tail of low-quality human reviews. We discuss how reasoning is key to achieving these improvements and release the Human-aligned Peer Review Reward (HPRR) function, the Peer Review Reasoning-enriched Traces (PeerRT) dataset, and the REMOR models, which we believe can help spur progress in the area.
Problem

Research questions and friction points this paper is trying to address.

Improving AI-generated peer reviews to avoid shallow feedback
Aligning review quality with human standards via multi-objective rewards
Enhancing review relevance and novelty using reasoning LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-objective reinforcement learning for review generation
Human-aligned reward function for quality feedback
Reasoning-enriched dataset for fine-tuning LLMs
๐Ÿ”Ž Similar Papers
No similar papers found.
Pawin Taechoyotin
Pawin Taechoyotin
PhD student, University of Colorado Boulder
Computer ScienceScience of Science
D
Daniel Acuna
Department of Computer Science, University of Colorado Boulder; Department of Information Science, University of Colorado Boulder; ReviewerZero AI Inc., Boulder, CO