🤖 AI Summary
Current LLM-based peer review methods struggle to simultaneously achieve depth, efficiency, and interpretability. To address this, we propose a dynamic hierarchical question-answering framework that models review as recursively constructing and bottom-up solving a problem tree: hierarchical question decomposition, LLM-driven on-demand question generation, bidirectional tree-structured reasoning, and answer aggregation jointly enable controllable depth and full process transparency. We introduce the first benchmark for review-oriented evaluation—ICLR/NeurIPS ReviewBench—and design a dynamic question expansion mechanism that reduces token consumption by 80% without compromising review quality. Extensive experiments under both human and LLM evaluation demonstrate that our method significantly outperforms strong baselines: it achieves higher expert agreement, yields more comprehensive and insightful review comments, and exhibits superior robustness and fidelity across diverse paper domains.
📝 Abstract
While Large Language Models (LLMs) have shown significant potential in assisting peer review, current methods often struggle to generate thorough and insightful reviews while maintaining efficiency. In this paper, we propose TreeReview, a novel framework that models paper review as a hierarchical and bidirectional question-answering process. TreeReview first constructs a tree of review questions by recursively decomposing high-level questions into fine-grained sub-questions and then resolves the question tree by iteratively aggregating answers from leaf to root to get the final review. Crucially, we incorporate a dynamic question expansion mechanism to enable deeper probing by generating follow-up questions when needed. We construct a benchmark derived from ICLR and NeurIPS venues to evaluate our method on full review generation and actionable feedback comments generation tasks. Experimental results of both LLM-based and human evaluation show that TreeReview outperforms strong baselines in providing comprehensive, in-depth, and expert-aligned review feedback, while reducing LLM token usage by up to 80% compared to computationally intensive approaches. Our code and benchmark dataset are available at https://github.com/YuanChang98/tree-review.