🤖 AI Summary
The surge in AI-generated research grant proposals has vastly outpaced human reviewers’ capacity, creating an urgent need to evaluate the reliability of large language models (LLMs) in high-stakes peer review. This work proposes a structured textual perturbation framework to systematically assess LLM sensitivity across six quality dimensions and introduces a “role-based committee” multi-agent ensemble approach to emulate expert review panels. Experiments demonstrate that a segmented review architecture significantly outperforms baseline methods in both detection accuracy and scoring reliability. While LLMs readily identify alignment issues, they struggle to detect deficiencies in clarity and tend to prioritize compliance checks over holistic judgment. This study establishes an interpretable and verifiable paradigm for automating scientific proposal evaluation.
📝 Abstract
As AI-assisted grant proposals outpace manual review capacity in a kind of ``Malthusian trap''for the research ecosystem, this paper investigates the capabilities and limitations of LLM-based grant reviewing for high-stakes evaluation. Using six EPSRC proposals, we develop a perturbation-based framework probing LLM sensitivity across six quality axes: funding, timeline, competency, alignment, clarity, and impact. We compare three review architectures: single-pass review, section-by-section analysis, and a'Council of Personas'ensemble emulating expert panels. The section-level approach significantly outperforms alternatives in both detection rate and scoring reliability, while the computationally expensive council method performs no better than baseline. Detection varies substantially by perturbation type, with alignment issues readily identified but clarity flaws largely missed by all systems. Human evaluation shows LLM feedback is largely valid but skewed toward compliance checking over holistic assessment. We conclude that current LLMs may provide supplementary value within EPSRC review but exhibit high variability and misaligned review priorities. We release our code and any non-protected data.