BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Biomedical coreference resolution faces challenges including terminological complexity, high lexical ambiguity of mentions, and long-distance dependencies. This work systematically evaluates the capabilities of generative large language models (LLMs) on this task and proposes four lightweight prompting strategies that integrate local context enhancement with domain-specific knowledge—such as abbreviation mappings and entity dictionaries—to significantly improve performance. Experiments are conducted on the CRAFT corpus and benchmarked against discriminative models like SpanBERT. Results show that LLaMA-8B and LLaMA-17B achieve substantial gains in precision and F1-score under entity-augmented prompting; generative models excel at surface-form coreference identification but remain limited in handling long-distance dependencies. This study provides the first systematic validation of domain-aware prompt engineering for biomedical coreference resolution, offering a reproducible, lightweight methodology for adapting LLMs to specialized biomedical NLP tasks.

Technology Category

Application Category

📝 Abstract

Coreference resolution in biomedical texts presents unique challenges due to complex domain-specific terminology, high ambiguity in mention forms, and long-distance dependencies between coreferring expressions. In this work, we present a comprehensive evaluation of generative large language models (LLMs) for coreference resolution in the biomedical domain. Using the CRAFT corpus as our benchmark, we assess the LLMs' performance with four prompting experiments that vary in their use of local, contextual enrichment, and domain-specific cues such as abbreviations and entity dictionaries. We benchmark these approaches against a discriminative span-based encoder, SpanBERT, to compare the efficacy of generative versus discriminative methods. Our results demonstrate that while LLMs exhibit strong surface-level coreference capabilities, especially when supplemented with domain-grounding prompts, their performance remains sensitive to long-range context and mentions ambiguity. Notably, the LLaMA 8B and 17B models show superior precision and F1 scores under entity-augmented prompting, highlighting the potential of lightweight prompt engineering for enhancing LLM utility in biomedical NLP tasks.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for biomedical coreference resolution challenges

Assessing domain-specific prompts on terminology and ambiguity

Comparing generative versus discriminative coreference resolution methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating LLMs with domain-specific prompting techniques

Using entity-augmented prompts to enhance biomedical coreference

Benchmarking generative against discriminative coreference resolution methods

🔎 Similar Papers

Benchmarking large language models for biomedical natural language processing applications and recommendations