🤖 AI Summary
This work addresses the challenges of insufficient factual accuracy and weak rebuttal strength in large language models when automatically generating responses to peer reviews. The authors propose DEFEND, a novel system that introduces an author-in-the-loop, stepwise generation paradigm. By decomposing the rebuttal process into structured reasoning stages—including review segmentation, flaw identification, error-type annotation, and rebuttal-action mapping—DEFEND enables precise counterarguments with minimal author intervention. Experimental and user studies demonstrate that DEFEND significantly outperforms end-to-end and fully automated stepwise approaches in both factual accuracy and rebuttal effectiveness, achieving a favorable balance between automation efficiency and human controllability.
📝 Abstract
Rebuttal generation is a critical component of the peer review process for scientific papers, enabling authors to clarify misunderstandings, correct factual inaccuracies, and guide reviewers toward a more accurate evaluation. We observe that Large Language Models (LLMs) often struggle to perform targeted refutation and maintain accurate factual grounding when used directly for rebuttal generation, highlighting the need for structured reasoning and author intervention. To address this, in the paper, we introduce DEFEND an LLM based tool designed to explicitly execute the underlying reasoning process of automated rebuttal generation, while keeping the author-in-the-loop. As opposed to writing the rebuttals from scratch, the author needs to only drive the reasoning process with minimal intervention, leading an efficient approach with minimal effort and less cognitive load. We compare DEFEND against three other paradigms: (i) Direct rebuttal generation using LLM (DRG), (ii) Segment-wise rebuttal generation using LLM (SWRG), and (iii) Sequential approach (SA) of segment-wise rebuttal generation without author intervention. To enable finegrained evaluation, we extend the ReviewCritique dataset, creating review segmentation, deficiency, error type annotations, rebuttal-action labels, and mapping to gold rebuttal segments. Experimental results and a user study demonstrate that directly using LLMs perform poorly in factual correctness and targeted refutation. Segment-wise generation and the automated sequential approach with author-in-the-loop, substantially improve factual correctness and strength of refutation.