Does Your VFM Speak Plant? The Botanical Grammar of Vision Foundation Models for Object Detection

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the performance limitations of vision foundation models in zero-shot object detection for agriculture, which stem from suboptimal text prompt construction and a lack of systematic optimization strategies. The authors propose an eight-dimensional prompt decomposition and composition framework, systematically evaluating open-vocabulary detectors—including YOLO World, SAM3, Grounding DINO, and OWLv2—on both synthetic and real-world cowpea flower and pod images. Through ablation studies and large language model–assisted analysis, they identify model-specific, non-intuitive, yet cross-domain transferable optimal prompt combinations. Experiments demonstrate that prompts optimized solely on synthetic data yield substantial performance gains (e.g., +0.357 and +0.362 in mAP@0.5 for YOLO World and OWLv2, respectively), often matching or surpassing those tuned on real annotated data, thereby establishing, for the first time, the effective transferability of purely synthetic prompts to real agricultural settings.

Technology Category

Application Category

📝 Abstract

Vision foundation models (VFMs) offer the promise of zero-shot object detection without task-specific training data, yet their performance in complex agricultural scenes remains highly sensitive to text prompt construction. We present a systematic prompt optimization framework evaluating four open-vocabulary detectors -- YOLO World, SAM3, Grounding DINO, and OWLv2 -- for cowpea flower and pod detection across synthetic and real field imagery. We decompose prompts into eight axes and conduct one-factor-at-a-time analysis followed by combinatorial optimization, revealing that models respond divergently to prompt structure: conditions that optimize one architecture can collapse another. Applying model-specific combinatorial prompts yields substantial gains over a naive species-name baseline, including +0.357 mAP@0.5 for YOLO World and +0.362 mAP@0.5 for OWLv2 on synthetic cowpea flower data. To evaluate cross-task generalization, we use an LLM to translate the discovered axis structure to a morphologically distinct target -- cowpea pods -- and compare against prompting using the discovered optimal structures from synthetic flower data. Crucially, prompt structures optimized exclusively on synthetic data transfer effectively to real-world fields: synthetic-pipeline prompts match or exceed those discovered on labeled real data for the majority of model-object combinations (flower: 0.374 vs. 0.353 for YOLO World; pod: 0.429 vs. 0.371 for SAM3). Our findings demonstrate that prompt engineering can substantially close the gap between zero-shot VFMs and supervised detectors without requiring manual annotation, and that optimal prompts are model-specific, non-obvious, and transferable across domains.

Problem

Research questions and friction points this paper is trying to address.

vision foundation models

zero-shot object detection

prompt engineering

agricultural scenes

open-vocabulary detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt engineering

vision foundation models

zero-shot object detection