FlyAOC: Evaluating Agentic Ontology Curation of Drosophila Scientific Knowledge Bases

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing benchmarks for scientific knowledge base construction primarily focus on isolated subtasks such as named entity recognition or relation extraction, which are insufficient for evaluating the holistic capabilities of AI systems in end-to-end ontology building. This work proposes FlyBench, the first agent-based evaluation benchmark tailored for end-to-end ontology construction, requiring models to generate structured annotations—covering gene function, expression patterns, and historical synonyms—from 16,898 Drosophila literature articles, with evaluation against 7,397 expert-curated FlyBase annotations. Comparative analysis of retrieval-augmented generation, single-agent, fixed-pipeline, and multi-agent architectures reveals that multi-agent systems significantly outperform baseline approaches, yet overall performance remains far from practical utility. The study further uncovers diminishing returns with model scaling and demonstrates that agents predominantly rely on retrieved evidence for verification rather than genuine knowledge discovery.

Technology Category

Application Category

📝 Abstract

Scientific knowledge bases accelerate discovery by curating findings from primary literature into structured, queryable formats for both human researchers and emerging AI systems. Maintaining these resources requires expert curators to search relevant papers, reconcile evidence across documents, and produce ontology-grounded annotations - a workflow that existing benchmarks, focused on isolated subtasks like named entity recognition or relation extraction, do not capture. We present FlyBench to evaluate AI agents on end-to-end agentic ontology curation from scientific literature. Given only a gene symbol, agents must search and read from a corpus of 16,898 full-text papers to produce structured annotations: Gene Ontology terms describing function, expression patterns, and historical synonyms linking decades of nomenclature. The benchmark includes 7,397 expert-curated annotations across 100 genes drawn from FlyBase, the Drosophila (fruit fly) knowledge base. We evaluate four baseline agent architectures: memorization, fixed pipeline, single-agent, and multi-agent. We find that architectural choices significantly impact performance, with multi-agent designs outperforming simpler alternatives, yet scaling backbone models yields diminishing returns. All baselines leave substantial room for improvement. Our analysis surfaces several findings to guide future development; for example, agents primarily use retrieval to confirm parametric knowledge rather than discover new information. We hope FlyBench will drive progress on retrieval-augmented scientific reasoning, a capability with broad applications across scientific domains.

Problem

Research questions and friction points this paper is trying to address.

ontology curation

scientific knowledge bases

agentic AI

Drosophila

structured annotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic ontology curation

scientific knowledge base

retrieval-augmented reasoning