Frontier LLM-based agents can overcome the ontology curation bottleneck for natural phenotypes

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the bottleneck of phenotype annotation—its reliance on manual curation and limited scalability—by introducing state-of-the-art large language models (LLMs) as “agent curators” that autonomously map free-text descriptions to standardized ontology terms (UBERON, PATO, BSPO, GO) within a closed workspace. The approach integrates original literature in PDF format, curation guidelines, and semantic validation scripts, leveraging five hosted LLMs from Anthropic and OpenAI to perform end-to-end annotation. Evaluated against a Gold Standard benchmark, all LLM agents achieved performance within the inter-annotator agreement range of human curators and significantly outperformed the conventional tool Semantic CharaParser, thereby demonstrating the feasibility and superiority of LLMs for biological ontology annotation.
📝 Abstract
Linking free-text phenotype descriptions to ontology terms, typically referred to as phenotype annotation, is essential for the cross-study integration of comparative morphological data. This labor intensive process has heavily relied on highly trained human experts, which makes it challenging to scale and thus a key bottleneck. Dahdul et al. (2018) established a Gold Standard (GS) of Entity-Quality (EQ) annotations across seven phylogenetic studies and used it to evaluate three human curators and the Semantic CharaParser NLP tool with ontology-based semantic similarity metrics; they reported that machine-human consistency was significantly lower than inter-curator (human-human) consistency. Here we revisit that benchmark with five frontier hosted LLMs from Anthropic and OpenAI, each operating as an "agentic curator" within a self-contained workspace that supplies the source publication PDF, the same annotation guide used by the original human curators, the four project ontologies (UBERON, PATO, BSPO, GO), and a validation script. Evaluated against the same Gold Standard, every agent fell within the range of inter-curator variability of the three trained human biocurators of the original study; the best performing agents approached but did not reach the best performing human curator. Agents substantially outperformed Semantic CharaParser on all four metrics.
Problem

Research questions and friction points this paper is trying to address.

phenotype annotation
ontology curation
natural language processing
biocuration
comparative morphology
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based agents
phenotype annotation
ontology curation
Entity-Quality (EQ) framework
semantic similarity