BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

📅 2026-04-17

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Current large language models lack a unified and comprehensive benchmark for evaluating animal-domain knowledge under closed-book conditions without external retrieval. To address this gap, this work proposes BAGEL, the first fine-grained closed-book question-answering benchmark specifically designed for animal knowledge. BAGEL integrates multi-source heterogeneous data from bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, spanning dimensions such as taxonomy, morphology, habitat, and behavior. By combining manual curation with automated question generation, BAGEL enables precise analysis across knowledge categories, taxonomic groups, and data sources. This benchmark systematically reveals the strengths and limitations of language models in biodiversity-related knowledge, offering a new platform for assessing domain-specific generalization and the reliability of downstream applications.

Technology Category

Application Category

📝 Abstract

Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination of curated examples and automatically generated closed-book question-answer pairs. The benchmark covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures animal-related knowledge of models without external retrieval at inference time. BAGEL further supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, enabling a more precise characterization of model strengths and systematic failure modes. Our benchmark provides a new testbed for studying domain-specific knowledge generalization in language models and for improving their reliability in biodiversity-related applications.

Problem

Research questions and friction points this paper is trying to address.

animal knowledge

language models

closed-book evaluation

domain-specific knowledge

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

closed-book evaluation

animal knowledge benchmark

domain-specific knowledge