Can ChatGPT evaluate research environments? Evidence from REF2021

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Expert evaluation of research environment statements under the UK’s Research Excellence Framework (REF) is resource-intensive and prone to inter-reviewer variability. Method: This study pioneers the application of large language models (LLMs)—including ChatGPT variants and Gemini Flash—to automate REF 2021–compliant scoring of environment statements across 34 subject units, using structured prompts aligned with official assessment criteria; agreement with expert scores was quantified via Spearman’s rank correlation coefficient (ρ). Contribution/Results: LLMs achieved statistically significant moderate-to-strong positive correlations with expert judgments in 32 of 34 units (peak ρ = 0.82), demonstrating their viability as decision-support tools. However, they do not yet fully supplant human reviewers. This work constitutes the first empirical investigation of AI-assisted national-level research environment evaluation, establishing a methodological foundation and empirical evidence for integrating LLMs into research assessment policy and administration.

Technology Category

Application Category

📝 Abstract
UK academic departments are evaluated partly on the statements that they write about the value of their research environments for the Research Excellence Framework (REF) periodic assessments. These statements mix qualitative narratives and quantitative data, typically requiring time-consuming and difficult expert judgements to assess. This article investigates whether Large Language Models (LLMs) can support the process or validate the results, using the UK REF2021 unit-level environment statements as a test case. Based on prompts mimicking the REF guidelines, ChatGPT 4o-mini scores correlated positively with expert scores in almost all 34 (field-based) Units of Assessment (UoAs). ChatGPT's scores had moderate to strong positive Spearman correlations with REF expert scores in 32 out of 34 UoAs: 14 UoAs above 0.7 and a further 13 between 0.6 and 0.7. Only two UoAs had weak or no significant associations (Classics and Clinical Medicine). From further tests for UoA34, multiple LLMs had significant positive correlations with REF2021 environment scores (all p < .001), with ChatGPT 5 performing best (r=0.81; $ρ$=0.82), followed by ChatGPT-4o-mini (r=0.68; $ρ$=0.67) and Gemini Flash 2.5 (r=0.67; $ρ$=0.69). If LLM-generated scores for environment statements are used in future to help reduce workload, support more consistent interpretation, and complement human review then caution must be exercised because of the potential for biases, inaccuracy in some cases, and unwanted systemic effects. Even the strong correlations found here seem unlikely to be judged close enough to expert scores to fully delegate the assessment task to LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating if LLMs can assess research environment statements.
Testing ChatGPT's correlation with expert scores in REF2021.
Investigating LLMs to reduce workload and support consistency.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using ChatGPT to score research environment statements automatically.
Comparing LLM scores with expert assessments for validation.
Applying multiple LLM models to reduce workload and improve consistency.
🔎 Similar Papers
No similar papers found.