AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the challenges of agricultural visual grounding, where targets are often small, repetitive, occluded, and associated with ambiguous linguistic references, and where a unified evaluation benchmark has been lacking. The authors propose AgroVG, the first generalized visual grounding benchmark tailored to agricultural scenarios, formalizing the task as set prediction: given an image and a referring expression, a model must output all matching instances or explicitly abstain. AgroVG integrates 10 data sources to construct 10,071 image-query pairs covering six object categories and three query types, supporting both bounding box (T1) and instance mask (T2) tasks. It introduces a set-level evaluation protocol and an existence-aware abstention mechanism. Zero-shot evaluations of 26 models reveal limited current performance—achieving at most a Set-F₁ of 0.35 in multi-target settings and mask success rates below 0.17 at IoU@0.75—highlighting the significant difficulty of this task.

📝 Abstract

Visual grounding, the task of localizing objects described by natural-language expressions, is a foundational capability for agricultural AI systems, enabling applications such as selective weeding, disease monitoring, and targeted harvesting. Reliable evaluation of agricultural visual grounding remains challenging because agricultural targets are often small, repetitive, occluded, or irregularly shaped, and instructions may refer to one, many, or no objects in an image. Evaluating this capability therefore requires jointly testing localization accuracy, target-set completeness, and existence-aware abstention. To address these challenges, we introduce \textbf{AgroVG}, a multi-source benchmark that formulates agricultural grounding as generalized set prediction: given an image and a referring expression, a model must return all matching target instances or abstain when no target is present. AgroVG contains 10{,}071 annotation-grounded image-query pairs from ten source datasets across six target families: crop/weed, fruit, wheat head, pest, plant disease, and tree canopy. It supports bounding-box grounding (T1) across all six families and instance-mask grounding (T2) on sources with reliable instance-level pixel annotations, with queries covering single-target, multi-target, and target-absent regimes. AgroVG further provides task-specific protocols for box-set matching and query-level mask coverage. Zero-shot evaluation of 26 model configurations spanning closed-source MLLMs, open-source VLMs, and specialized grounding systems reveals persistent gaps: the best multi-target Set-$F_1$ reaches only 0.35, and the best positive-query mask success rate at IoU@0.75 remains below 0.17. Data and code are available at https://anonymous.4open.science/r/AgroVG-5172/ .

Problem

Research questions and friction points this paper is trying to address.

visual grounding

agricultural AI

benchmark

object localization

set prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual grounding

agricultural AI

generalized set prediction