🤖 AI Summary
Accurately extracting UK Research Excellence Framework (REF) ratings (1*–4*) from noisy, unstructured text containing missing or invalid values presents a significant challenge, requiring large language models (LLMs) to output only normalized integers (1–4) or a designated missing-value indicator (−1). To address this, this work introduces the first standardized prompt engineering benchmark for complex numerical extraction tasks, accompanied by a publicly available dataset of 1,446 short texts with gold-standard annotations. By integrating semantic understanding with explicit rule-based constraints, an initial prompting strategy achieves 72.6% accuracy. The study clarifies the definition of valid ratings and formalizes a mechanism for handling missing data, thereby advancing research into LLMs’ numerical reasoning and instruction-following capabilities, with the aim of fostering community-driven improvements in structured information extraction from noisy textual sources.
📝 Abstract
In some areas of computing, natural language processing and information science, progress is made by sharing datasets and challenging the community to design the best algorithm for an associated task. This article introduces a shared dataset of 1446 short texts, each of which describes a research quality score on the UK scale of 1* to 4*. This is a messy collection, with some texts not containing scores and others including invalid scores or strange formats. With this dataset there is also a description of what constitutes a valid score and a"gold standard"of the correct scores for these texts (including missing values). The challenge is to design a prompt for Large Language Models (LLMs) to extract the scores from these texts as accurately as possible. The format for the response should be a number and no other text so there are two aspects to the challenge: ensuring that the LLM returns only a number, and instructing it to deduce the correct number for the text. As part of this, the LLM prompt needs to explain when to return the missing value code, -1, instead of a number when the text does not clearly contain one. The article also provides an example of a simple prompt. The purpose of the challenge is twofold: to get an effective solution to this problem, and to increase understanding of prompt design and LLM capabilities for complex numerical tasks. The initial solution suggested has an accuracy of 72.6%, so the challenge is to beat this.