🤖 AI Summary
The agricultural remote sensing (RS) community lacks a comprehensive, large-scale multimodal evaluation benchmark tailored for Large Multimodal Models (LMMs), suffering from narrow application scenarios, coarse-grained tasks, and insufficient coverage of cognitive dimensions. Method: We propose AgroMind—the first agricultural RS-specific multimodal benchmark—spanning four cognitive dimensions: spatial perception, object understanding, scene understanding, and reasoning, comprising 13 fine-grained tasks, 25,026 question-answer pairs, and 15,556 multi-source RS images. We systematically design a domain-specific, multi-granularity cognitive task taxonomy, develop an automated, agriculture-aware question-generation pipeline, and establish a unified evaluation framework covering 18 open-source and 3 closed-source LMMs. Results: Experiments reveal that current LMMs significantly underperform humans in spatial reasoning and fine-grained identification, yet surpass human accuracy in crop classification. AgroMind provides the first reproducible, extensible evaluation standard for multimodal models in agricultural RS.
📝 Abstract
Large Multimodal Models (LMMs) has demonstrated capabilities across various domains, but comprehensive benchmarks for agricultural remote sensing (RS) remain scarce. Existing benchmarks designed for agricultural RS scenarios exhibit notable limitations, primarily in terms of insufficient scene diversity in the dataset and oversimplified task design. To bridge this gap, we introduce AgroMind, a comprehensive agricultural remote sensing benchmark covering four task dimensions: spatial perception, object understanding, scene understanding, and scene reasoning, with a total of 13 task types, ranging from crop identification and health monitoring to environmental analysis. We curate a high-quality evaluation set by integrating eight public datasets and one private farmland plot dataset, containing 25,026 QA pairs and 15,556 images. The pipeline begins with multi-source data preprocessing, including collection, format standardization, and annotation refinement. We then generate a diverse set of agriculturally relevant questions through the systematic definition of tasks. Finally, we employ LMMs for inference, generating responses, and performing detailed examinations. We evaluated 18 open-source LMMs and 3 closed-source models on AgroMind. Experiments reveal significant performance gaps, particularly in spatial reasoning and fine-grained recognition, it is notable that human performance lags behind several leading LMMs. By establishing a standardized evaluation framework for agricultural RS, AgroMind reveals the limitations of LMMs in domain knowledge and highlights critical challenges for future work. Data and code can be accessed at https://rssysu.github.io/AgroMind/.