LLMs Can Assist with Proposal Selection at Large User Facilities

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Large-scale user facility proposal reviews suffer from weak inter-rater correlation, significant human bias, and poor consistency; while pairwise preference methods are theoretically sound, their manual implementation is prohibitively costly. This work introduces the first systematic application of large language models (LLMs) to high-accuracy, low-bias relative ranking of scientific proposals. We integrate LLM fine-tuning and prompt engineering (GPT-4/Claude), embedding-based proposal representation, and robust outlier detection to automate pairwise preference modeling. Our approach enables deep, human-infeasible analyses—e.g., quantitative proposal similarity assessment. Experiments show that LLM-derived rankings achieve Spearman correlations of 0.2–0.8 with human rankings (≥0.5 after denoising), match domain experts in identifying high-impact proposals, reduce review costs by over two orders of magnitude, and substantially improve consistency, scalability, and fairness in proposal selection.

Technology Category

Application Category

📝 Abstract

We explore how large language models (LLMs) can enhance the proposal selection process at large user facilities, offering a scalable, consistent, and cost-effective alternative to traditional human review. Proposal selection depends on assessing the relative strength among submitted proposals; however, traditional human scoring often suffers from weak inter-proposal correlations and is subject to reviewer bias and inconsistency. A pairwise preference-based approach is logically superior, providing a more rigorous and internally consistent basis for ranking, but its quadratic workload makes it impractical for human reviewers. We address this limitation using LLMs. Leveraging the uniquely well-curated proposals and publication records from three beamlines at the Spallation Neutron Source (SNS), Oak Ridge National Laboratory (ORNL), we show that the LLM rankings correlate strongly with the human rankings (Spearman $ hosimeq 0.2-0.8$, improving to $geq 0.5$ after 10% outlier removal). Moreover, LLM performance is no worse than that of human reviewers in identifying proposals with high publication potential, while costing over two orders of magnitude less. Beyond ranking, LLMs enable advanced analyses that are challenging for humans, such as quantitative assessment of proposal similarity via embedding models, which provides information crucial for review committees.

Problem

Research questions and friction points this paper is trying to address.

Using LLMs to improve proposal selection scalability and consistency at large facilities

Addressing human reviewer bias and inconsistency in proposal evaluation processes

Reducing costs while maintaining quality in identifying high-potential research proposals

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs rank proposals using pairwise preference approach

LLMs correlate strongly with human rankings after outlier removal

LLMs assess proposal similarity via embedding models for committees

🔎 Similar Papers

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions