AgriEval: A Comprehensive Chinese Agricultural Benchmark for Large Language Models

📅 2025-07-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The absence of high-quality Chinese evaluation benchmarks in agriculture hinders rigorous assessment and optimization of large language models (LLMs) for domain-specific applications. Method: We introduce AgriEval, the first comprehensive Chinese agricultural evaluation benchmark, covering six major categories and 29 subcategories. It comprises 14,697 multiple-choice questions and 2,167 open-ended questions, systematically evaluating LLMs across four cognitive dimensions: memorization, comprehension, reasoning, and generation. AgriEval is manually curated from authoritative higher-education agricultural curricula, ensuring high quality, multi-format support, and scale. Contribution/Results: Systematic evaluation of 51 open-source and commercial LLMs reveals consistently low average accuracy—below 60%—highlighting substantial challenges in agricultural language modeling. AgriEval fills a critical gap in Chinese agricultural LLM evaluation, providing both a standardized benchmark and an interpretable performance attribution framework to guide future research and development.

Technology Category

Application Category

📝 Abstract
In the agricultural domain, the deployment of large language models (LLMs) is hindered by the lack of training data and evaluation benchmarks. To mitigate this issue, we propose AgriEval, the first comprehensive Chinese agricultural benchmark with three main characteristics: (1) Comprehensive Capability Evaluation. AgriEval covers six major agriculture categories and 29 subcategories within agriculture, addressing four core cognitive scenarios: memorization, understanding, inference, and generation. (2) High-Quality Data. The dataset is curated from university-level examinations and assignments, providing a natural and robust benchmark for assessing the capacity of LLMs to apply knowledge and make expert-like decisions. (3) Diverse Formats and Extensive Scale. AgriEval comprises 14,697 multiple-choice questions and 2,167 open-ended question-and-answer questions, establishing it as the most extensive agricultural benchmark available to date. We also present comprehensive experimental results over 51 open-source and commercial LLMs. The experimental results reveal that most existing LLMs struggle to achieve 60% accuracy, underscoring the developmental potential in agricultural LLMs. Additionally, we conduct extensive experiments to investigate factors influencing model performance and propose strategies for enhancement. AgriEval is available at https://github.com/YanPioneer/AgriEval/.
Problem

Research questions and friction points this paper is trying to address.

Lack of agricultural training data for LLMs
Absence of evaluation benchmarks in agriculture
Low accuracy of current LLMs in agriculture
Innovation

Methods, ideas, or system contributions that make the work stand out.

First comprehensive Chinese agricultural benchmark
High-quality data from university-level sources
Diverse formats with extensive question scale
🔎 Similar Papers
No similar papers found.
Lian Yan
Lian Yan
Harbin Institute of Technology
Large Language ModelDialogue System for Medical Diagnosis
H
Haotian Wang
Harbin Institute of Technology
C
Chen Tang
MemTensor (Shanghai) Technology Co., Ltd.
Haifeng Liu
Haifeng Liu
Zhejiang University
Machine LearningData ManagementInformaiton Retrieval
T
Tianyang Sun
Harbin Institute of Technology
L
Liangliang Liu
Harbin Institute of Technology
Y
Yi Guan
Harbin Institute of Technology
Jingchi Jiang
Jingchi Jiang
Harbin Institute of Technology
Knowledge GraphMachine LearningData Mining