🤖 AI Summary
This study systematically evaluates the reliability of large language models (LLMs) in academic peer review, their alignment with human reviewers, and their robustness against prompt injection attacks. Leveraging a newly constructed multidimensional benchmark encompassing 12 LLMs assessed on 898 NeurIPS and ICLR papers, the work employs stratified sampling, adversarial prompt injections using invisible font mappings, and analyses of textual diversity and lexical normalization. Findings reveal that LLMs consistently overrate weak submissions and exhibit topical preference biases; while their reviews are two to three times longer than those of humans, they display lower lexical diversity. Moreover, stealthy prompt injections can substantially inflate low-scoring papers above acceptance thresholds, with sensitivity varying markedly across model families.
📝 Abstract
Large language models (LLMs) are increasingly used in academic peer review, yet their reliability, alignment with human judgment, and robustness to adversarial attacks remain poorly understood. We present a systematic benchmark of LLM-as-a-Reviewer on 898 papers stratified from NeurIPS and ICLR, evaluating 12 LLMs along three axes: rating calibration, divergence from human reviewers, and resistance to prompt injection embedded via an invisible font-mapping attack. We find that LLMs systematically overrate weaker submissions and diverge from humans in topical emphasis, under-flagging Clarity and over-flagging Reproducibility, while producing reviews two to three times longer with lower lexical diversity and a more standardized vocabulary. Prompt injection remains highly effective. Simple hidden instructions can promote low-scoring papers to acceptance-level ratings in a substantial fraction of cases, with effectiveness varying sharply across model families. While LLMs offer utility in structuring evaluations, their integration into peer review requires safeguards against both intrinsic biases and adversarial risks.