Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind

📅 2025-05-18

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

The agricultural remote sensing (RS) community lacks a comprehensive, large-scale multimodal evaluation benchmark tailored for Large Multimodal Models (LMMs), suffering from narrow application scenarios, coarse-grained tasks, and insufficient coverage of cognitive dimensions. Method: We propose AgroMind—the first agricultural RS-specific multimodal benchmark—spanning four cognitive dimensions: spatial perception, object understanding, scene understanding, and reasoning, comprising 13 fine-grained tasks, 25,026 question-answer pairs, and 15,556 multi-source RS images. We systematically design a domain-specific, multi-granularity cognitive task taxonomy, develop an automated, agriculture-aware question-generation pipeline, and establish a unified evaluation framework covering 18 open-source and 3 closed-source LMMs. Results: Experiments reveal that current LMMs significantly underperform humans in spatial reasoning and fine-grained identification, yet surpass human accuracy in crop classification. AgroMind provides the first reproducible, extensible evaluation standard for multimodal models in agricultural RS.

Technology Category

Application Category

📝 Abstract

Large Multimodal Models (LMMs) has demonstrated capabilities across various domains, but comprehensive benchmarks for agricultural remote sensing (RS) remain scarce. Existing benchmarks designed for agricultural RS scenarios exhibit notable limitations, primarily in terms of insufficient scene diversity in the dataset and oversimplified task design. To bridge this gap, we introduce AgroMind, a comprehensive agricultural remote sensing benchmark covering four task dimensions: spatial perception, object understanding, scene understanding, and scene reasoning, with a total of 13 task types, ranging from crop identification and health monitoring to environmental analysis. We curate a high-quality evaluation set by integrating eight public datasets and one private farmland plot dataset, containing 25,026 QA pairs and 15,556 images. The pipeline begins with multi-source data preprocessing, including collection, format standardization, and annotation refinement. We then generate a diverse set of agriculturally relevant questions through the systematic definition of tasks. Finally, we employ LMMs for inference, generating responses, and performing detailed examinations. We evaluated 18 open-source LMMs and 3 closed-source models on AgroMind. Experiments reveal significant performance gaps, particularly in spatial reasoning and fine-grained recognition, it is notable that human performance lags behind several leading LMMs. By establishing a standardized evaluation framework for agricultural RS, AgroMind reveals the limitations of LMMs in domain knowledge and highlights critical challenges for future work. Data and code can be accessed at https://rssysu.github.io/AgroMind/.

Problem

Research questions and friction points this paper is trying to address.

Assessing LMMs' capability in agricultural remote sensing tasks

Addressing limitations in existing agricultural RS benchmarks

Evaluating LMMs' performance on diverse agricultural scene understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing AgroMind benchmark for agricultural remote sensing

Integrating multi-source datasets with 25,026 QA pairs

Evaluating 21 LMMs on diverse agricultural tasks

🔎 Similar Papers

On Large Uni- and Multi-modal Models for Unsupervised Classification of Social Media Images: Nature's Contribution to People as a case study