CropVLM: A Domain-Adapted Vision-Language Model for Open-Set Crop Analysis

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This study addresses the limitations of conventional closed-set vision systems in plant phenotyping, which struggle with species diversity, high annotation costs, and the inability to recognize novel crops emerging in breeding programs. To overcome these challenges, the authors propose CropVLM—the first open-set vision-language model tailored for agriculture—leveraging Domain-Specific Semantic Alignment (DSSA) and a Hybrid Open-Set Localization Network (HOS-Net). CropVLM enables zero-shot detection of previously unseen crops using only natural language descriptions, without requiring retraining or species-specific annotations. Trained on 52,987 field image–text pairs, CropVLM achieves 72.51% accuracy in zero-shot classification, substantially outperforming CLIP, and attains state-of-the-art performance with AP50 scores of 49.17 and 50.73 on the CVTCropDet and tropical fruit datasets, respectively.

📝 Abstract

High-throughput plant phenotyping, the quantitative measurement of observable plant traits, is critical for modern breeding but remains constrained by a "phenotyping bottleneck," where manual data collection is labor-intensive and prone to observer bias. Conventional closed-set computer vision systems fail to address this challenge, as they require extensive species-specific annotation and lack the flexibility to handle diverse breeding populations. To bridge this gap, we present CropVLM, a Vision-Language Model (VLM) adapted for the agricultural domain via Domain-Specific Semantic Alignment (DSSA). Trained on 52,987 manually selected image-caption pairs covering 37 species in natural field conditions, CropVLM effectively maps agronomic terminology to fine-grained visual features. We further introduce the Hybrid Open-Set Localization Network (HOS-Net), an architecture that integrates CropVLM to enable the detection of novel crops solely from natural language descriptions without retraining. By eliminating the reliance on species-specific training data, CropVLM provides a scalable solution for high-throughput phenotyping, accelerating genetic gain and facilitating large-scale biodiversity research essential for sustainable agriculture. The trained model weights and complete pipeline implementation are publicly available at: [https://github.com/boudiafA/CropVLM](https://github.com/boudiafA/CropVLM). In comprehensive evaluations, CropVLM achieves 72.51% zero-shot classification accuracy, outperforming seven CLIP-style baselines. Our detection pipeline demonstrates superior zero-shot generalization to novel species, achieving 49.17 AP50 on our CVTCropDet benchmark and 50.73 AP50 on tropical fruit species, compared to 34.89 and 48.58 for the next-best method, respectively.

Problem

Research questions and friction points this paper is trying to address.

high-throughput plant phenotyping

phenotyping bottleneck

open-set crop analysis

species-specific annotation

zero-shot generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Model

Domain-Specific Semantic Alignment

Open-Set Crop Detection