🤖 AI Summary
This study investigates whether large language models (LLMs) exhibit paradigm bias in academic quality assessment due to prompt design, focusing on competing theoretical paradigms in sociology. Method: Using 17 ChatGPT instances, the authors automatically evaluated 1,490 sociology journal articles under five role-based prompting conditions—paradigm adherents, opponents, adversarial agents, and neutral evaluators—to isolate the effect of stance on scoring behavior. Contribution/Results: The study provides the first empirical evidence that LLM-generated scores are significantly influenced by the ideological stance embedded in prompts: scores peak when the model is prompted to endorse a paradigm, drop substantially when prompted to oppose it, and show minimal inter-paradigm differences under neutral framing. These findings reveal an inherent risk of latent paradigm bias when deploying LLMs for scholarly evaluation and underscore the critical necessity of paradigm-neutral prompt engineering. The results offer foundational empirical support for ensuring fairness and epistemic integrity in AI-augmented academic assessment systems.
📝 Abstract
Purpose: It has become increasingly likely that Large Language Models (LLMs) will be used to score the quality of academic publications to support research assessment goals in the future. This may cause problems for fields with competing paradigms since there is a risk that one may be favoured, causing long term harm to the reputation of the other. Design/methodology/approach: To test whether this is plausible, this article uses 17 ChatGPTs to evaluate up to 100 journal articles from each of eight pairs of competing sociology paradigms (1490 altogether). Each article was assessed by prompting ChatGPT to take one of five roles: paradigm follower, opponent, antagonistic follower, antagonistic opponent, or neutral. Findings: Articles were scored highest by ChatGPT when it followed the aligning paradigm, and lowest when it was told to devalue it and to follow the opposing paradigm. Broadly similar patterns occurred for most of the paradigm pairs. Follower ChatGPTs displayed only a small amount of favouritism compared to neutral ChatGPTs, but articles evaluated by an opposing paradigm ChatGPT had a substantial disadvantage. Research limitations: The data covers a single field and LLM. Practical implications: The results confirm that LLM instructions for research evaluation should be carefully designed to ensure that they are paradigm-neutral to avoid accidentally resolving conflicts between paradigms on a technicality by devaluing one side's contributions. Originality/value: This is the first demonstration that LLMs can be prompted to show a partiality for academic paradigms.