FlyAOC: Evaluating Agentic Ontology Curation of Drosophila Scientific Knowledge Bases

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks for scientific knowledge base construction primarily focus on isolated subtasks such as named entity recognition or relation extraction, which are insufficient for evaluating the holistic capabilities of AI systems in end-to-end ontology building. This work proposes FlyBench, the first agent-based evaluation benchmark tailored for end-to-end ontology construction, requiring models to generate structured annotations—covering gene function, expression patterns, and historical synonyms—from 16,898 Drosophila literature articles, with evaluation against 7,397 expert-curated FlyBase annotations. Comparative analysis of retrieval-augmented generation, single-agent, fixed-pipeline, and multi-agent architectures reveals that multi-agent systems significantly outperform baseline approaches, yet overall performance remains far from practical utility. The study further uncovers diminishing returns with model scaling and demonstrates that agents predominantly rely on retrieved evidence for verification rather than genuine knowledge discovery.

Technology Category

Application Category

📝 Abstract
Scientific knowledge bases accelerate discovery by curating findings from primary literature into structured, queryable formats for both human researchers and emerging AI systems. Maintaining these resources requires expert curators to search relevant papers, reconcile evidence across documents, and produce ontology-grounded annotations - a workflow that existing benchmarks, focused on isolated subtasks like named entity recognition or relation extraction, do not capture. We present FlyBench to evaluate AI agents on end-to-end agentic ontology curation from scientific literature. Given only a gene symbol, agents must search and read from a corpus of 16,898 full-text papers to produce structured annotations: Gene Ontology terms describing function, expression patterns, and historical synonyms linking decades of nomenclature. The benchmark includes 7,397 expert-curated annotations across 100 genes drawn from FlyBase, the Drosophila (fruit fly) knowledge base. We evaluate four baseline agent architectures: memorization, fixed pipeline, single-agent, and multi-agent. We find that architectural choices significantly impact performance, with multi-agent designs outperforming simpler alternatives, yet scaling backbone models yields diminishing returns. All baselines leave substantial room for improvement. Our analysis surfaces several findings to guide future development; for example, agents primarily use retrieval to confirm parametric knowledge rather than discover new information. We hope FlyBench will drive progress on retrieval-augmented scientific reasoning, a capability with broad applications across scientific domains.
Problem

Research questions and friction points this paper is trying to address.

ontology curation
scientific knowledge bases
agentic AI
Drosophila
structured annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic ontology curation
scientific knowledge base
retrieval-augmented reasoning
multi-agent system
FlyBench
🔎 Similar Papers
No similar papers found.