🤖 AI Summary
Existing automated paper review methods rely either on shallow textual features or black-box large language models (LLMs), leading to hallucination, scoring bias, and an inability to model the dynamic nature of peer review. This paper proposes the first multi-agent debate-based review simulation framework: it orchestrates iterative argumentative interactions between LLM-driven reviewers and authors, explicitly encoding logical relations among claims—such as support, rebuttal, and clarification—as typed edges in a heterogeneous graph, and employs a heterogeneous graph neural network for structured reasoning. The core innovation lies in formalizing peer review as an interpretable graph reasoning task, jointly capturing semantic depth and interactive dynamics. Evaluated on three benchmark datasets, our method achieves an average relative improvement of 15.73% over state-of-the-art baselines.
📝 Abstract
Existing paper review methods often rely on superficial manuscript features or directly on large language models (LLMs), which are prone to hallucinations, biased scoring, and limited reasoning capabilities. Moreover, these methods often fail to capture the complex argumentative reasoning and negotiation dynamics inherent in reviewer-author interactions. To address these limitations, we propose ReViewGraph (Reviewer-Author Debates Graph Reasoner), a novel framework that performs heterogeneous graph reasoning over LLM-simulated multi-round reviewer-author debates. In our approach, reviewer-author exchanges are simulated through LLM-based multi-agent collaboration. Diverse opinion relations (e.g., acceptance, rejection, clarification, and compromise) are then explicitly extracted and encoded as typed edges within a heterogeneous interaction graph. By applying graph neural networks to reason over these structured debate graphs, ReViewGraph captures fine-grained argumentative dynamics and enables more informed review decisions. Extensive experiments on three datasets demonstrate that ReViewGraph outperforms strong baselines with an average relative improvement of 15.73%, underscoring the value of modeling detailed reviewer-author debate structures.