🤖 AI Summary
Agricultural decision-making critically depends on fine-grained contextual knowledge—spanning geography, climate, and economics—yet conventional large language models (LLMs) lack structured reasoning capabilities required for such domain-specific tasks.
Method: We propose the Agricultural Large Reasoning Model (LRM) paradigm, introducing (i) AgReason—the first expert-curated, open agricultural scientific reasoning benchmark (100 questions), and (ii) AgThoughts—a large-scale dataset of 44.6K question-answer pairs with human-annotated structured reasoning chains. Leveraging foundation models (e.g., Gemini), we conduct systematic reasoning capability evaluation, supervised fine-tuning, and reasoning-chain distillation to train AgThinker, a lightweight, deployable model.
Contribution/Results: The strongest Gemini baseline achieves only 36% accuracy on AgReason; in contrast, AgThinker runs efficiently on consumer-grade GPUs and significantly outperforms general-purpose LLMs. This work establishes the first comprehensive evaluation and modeling framework for agricultural reasoning, empirically validating that data-driven, domain-specific reasoning chains substantially enhance LLMs’ agricultural cognition.
📝 Abstract
Agricultural decision-making involves complex, context-specific reasoning, where choices about crops, practices, and interventions depend heavily on geographic, climatic, and economic conditions. Traditional large language models (LLMs) often fall short in navigating this nuanced problem due to limited reasoning capacity. We hypothesize that recent advances in large reasoning models (LRMs) can better handle such structured, domain-specific inference. To investigate this, we introduce AgReason, the first expert-curated open-ended science benchmark with 100 questions for agricultural reasoning. Evaluations across thirteen open-source and proprietary models reveal that LRMs outperform conventional ones, though notable challenges persist, with the strongest Gemini-based baseline achieving 36% accuracy. We also present AgThoughts, a large-scale dataset of 44.6K question-answer pairs generated with human oversight and equipped with synthetically generated reasoning traces. Using AgThoughts, we develop AgThinker, a suite of small reasoning models that can be run on consumer-grade GPUs, and show that our dataset can be effective in unlocking agricultural reasoning abilities in LLMs. Our project page is here: https://baskargroup.github.io/Ag_reasoning/