A Benchmark for End-to-End Zero-Shot Biomedical Relation Extraction with LLMs: Experiments with OpenAI Models

📅 2025-04-05

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work systematically evaluates the zero-shot capability of large language models (LLMs) for biomedical relation extraction (RE), addressing a critical gap in empirical evaluation. We introduce the first end-to-end zero-shot biomedical RE benchmark, comprising seven widely used datasets. For the first time, we comparatively assess GPT-4-turbo and o1 on multi-source RE tasks under zero-shot settings, and propose two structured output paradigms: JSON Schema–based explicit constraint and natural-language–guided implicit inference. Experimental results show that zero-shot LLM performance approaches that of supervised fine-tuning methods. We publicly release all code, datasets, and prompt templates. Our analysis identifies key limitations—particularly in recognizing co-occurring relations and localizing fine-grained entity boundaries—highlighting persistent challenges in biomedical LLM reasoning. The study establishes a reproducible evaluation framework and practical guidelines for deploying LLMs in biomedical RE.

Technology Category

Application Category

📝 Abstract

Objective: Zero-shot methodology promises to cut down on costs of dataset annotation and domain expertise needed to make use of NLP. Generative large language models trained to align with human goals have achieved high zero-shot performance across a wide variety of tasks. As of yet, it is unclear how well these models perform on biomedical relation extraction (RE). To address this knowledge gap, we explore patterns in the performance of OpenAI LLMs across a diverse sampling of RE tasks. Methods: We use OpenAI GPT-4-turbo and their reasoning model o1 to conduct end-to-end RE experiments on seven datasets. We use the JSON generation capabilities of GPT models to generate structured output in two ways: (1) by defining an explicit schema describing the structure of relations, and (2) using a setting that infers the structure from the prompt language. Results: Our work is the first to study and compare the performance of the GPT-4 and o1 for the end-to-end zero-shot biomedical RE task across a broad array of datasets. We found the zero-shot performances to be proximal to that of fine-tuned methods. The limitations of this approach are that it performs poorly on instances containing many relations and errs on the boundaries of textual mentions. Conclusion: Recent large language models exhibit promising zero-shot capabilities in complex biomedical RE tasks, offering competitive performance with reduced dataset curation and NLP modeling needs at the cost of increased computing, potentially increasing medical community accessibility. Addressing the limitations we identify could further boost reliability. The code, data, and prompts for all our experiments are publicly available: https://github.com/bionlproc/ZeroShotRE

Problem

Research questions and friction points this paper is trying to address.

Evaluates zero-shot performance of LLMs on biomedical relation extraction

Compares GPT-4 and o1 models across diverse RE datasets

Assesses cost-benefit tradeoffs of zero-shot RE versus fine-tuned methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses OpenAI GPT-4-turbo for zero-shot RE

Generates structured JSON output via schemas

Compares performance across diverse biomedical datasets

🔎 Similar Papers

Benchmarking large language models for biomedical natural language processing applications and recommendations