🤖 AI Summary
This study investigates the practical applicability of large language models (LLMs) to academic peer review and scientific discovery. Method: We propose the first multi-task, progressive evaluation framework tailored to computer science, comprising four task types—content reproduction, pairwise comparison, automated scoring, and qualitative reflection—and integrating three validation modalities: linguistic analysis, external benchmarking, and human evaluation. We further introduce role-based task design to align with distinct cognitive levels. Contribution/Results: Empirically evaluated on papers from top-tier information systems journals, results show that Gemini achieves moderate performance on abstract generation and paraphrasing but exhibits insufficient discriminative power in ranking and scoring tasks; its reflective outputs demonstrate consistency yet lack substantive insight. Overall, current LLMs do not support unsupervised deployment for peer review.
📝 Abstract
How much large language models (LLMs) can aid scientific discovery, notably in assisting academic peer review, is in heated debate. Between a literature digest and a human-comparable research assistant lies their practical application potential. We organize individual tasks that computer science studies employ in separate terms into a guided and robust workflow to evaluate LLMs' processing of academic text input. We employ four tasks in the assessment: content reproduction/comparison/scoring/reflection, each demanding a specific role of the LLM (oracle/judgmental arbiter/knowledgeable arbiter/collaborator) in assisting scholarly works, and altogether testing LLMs with questions that increasingly require intellectual capabilities towards a solid understanding of scientific texts to yield desirable solutions. We exemplify a rigorous performance evaluation with detailed instructions on the prompts. Adopting first-rate Information Systems articles at three top journals as the input texts and an abundant set of text metrics, we record a compromised performance of the leading LLM - Google's Gemini: its summary and paraphrase of academic text is acceptably reliable; using it to rank texts through pairwise text comparison is faintly scalable; asking it to grade academic texts is prone to poor discrimination; its qualitative reflection on the text is self-consistent yet hardly insightful to inspire meaningful research. This evidence against an endorsement of LLMs' text-processing capabilities is consistent across metric-based internal (linguistic assessment), external (comparing to the ground truth), and human evaluation, and is robust to the variations of the prompt. Overall, we do not recommend an unchecked use of LLMs in constructing peer reviews.