AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses persistent challenges in peer review—namely, variability in quality, consistency, and timeliness—by deploying, for the first time at conference scale, an AI-assisted review system during AAAI-26. The system generated clearly labeled AI review comments for all 22,977 submissions within 24 hours, leveraging state-of-the-art large language models, tool-augmented reasoning, and a multi-stage safety framework. A novel benchmark was established to evaluate AI’s capacity to identify scientific flaws, and large-scale surveys revealed that both authors and program committee members generally perceived AI reviews as superior to human reviews in technical accuracy and research recommendations. These findings significantly advance the paradigm of human-AI collaborative peer review.

Technology Category

Application Category

📝 Abstract

Scientific peer review faces mounting strain as submission volumes surge, making it increasingly difficult to sustain review quality, consistency, and timeliness. Recent advances in AI have led the community to consider its use in peer review, yet a key unresolved question is whether AI can generate technically sound reviews at real-world conference scale. Here we report the first large-scale field deployment of AI-assisted peer review: every main-track submission at AAAI-26 received one clearly identified AI review from a state-of-the-art system. The system combined frontier models, tool use, and safeguards in a multi-stage process to generate reviews for all 22,977 full-review papers in less than a day. A large-scale survey of AAAI-26 authors and program committee members showed that participants not only found AI reviews useful, but actually preferred them to human reviews on key dimensions such as technical accuracy and research suggestions. We also introduce a novel benchmark and find that our system substantially outperforms a simple LLM-generated review baseline at detecting a variety of scientific weaknesses. Together, these results show that state-of-the-art AI methods can already make meaningful contributions to scientific peer review at conference scale, opening a path toward the next generation of synergistic human-AI teaming for evaluating research.

Problem

Research questions and friction points this paper is trying to address.

peer review

AI-assisted review

scientific evaluation

large-scale deployment

review quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI-assisted peer review

large-scale deployment

multi-stage review system