🤖 AI Summary
This work addresses the limited generalization of existing deep learning models in heterogeneous agricultural settings, which hinders robust fine-grained crop–weed segmentation. To overcome this challenge, the authors propose VL-WS, a novel framework that, for the first time, integrates vision–language semantic alignment into multi-domain weed segmentation in farmland. The approach leverages frozen CLIP image–text embeddings fused with spatial features and introduces a FiLM layer conditioned on natural language prompts to enable channel-wise feature modulation. Evaluated on four benchmark datasets, VL-WS achieves an average Dice score of 91.64%, outperforming CNN baselines by 4.98%, and attains a Dice score of 80.45% on the most challenging weed classes—an improvement of 15.42%. The method substantially enhances model generalization and data efficiency across diverse imaging conditions, crop species, and growth stages.
📝 Abstract
Fine-grained crop-weed segmentation is essential for enabling targeted herbicide application in precision agriculture. However, existing deep learning models struggle to generalize across heterogeneous agricultural environments due to reliance on dataset-specific visual features. We propose Vision-Language Weed Segmentation (VL-WS), a novel framework that addresses this limitation by grounding pixel-level segmentation in semantically aligned, domain-invariant representations. Our architecture employs a dual-encoder design, where frozen Contrastive Language-Image Pretraining (CLIP) embeddings and task-specific spatial features are fused and modulated via Feature-wise Linear Modulation (FiLM) layers conditioned on natural language captions. This design enables image level textual descriptions to guide channel-wise feature refinement while preserving fine-grained spatial localization. Unlike prior works restricted to training and evaluation on single-source datasets, VL-WS is trained on a unified corpus that includes close-range ground imagery (robotic platforms) and high-altitude UAV imagery, covering diverse crop types, weed species, growth stages, and sensing conditions. Experimental results across four benchmark datasets demonstrate the effectiveness of our framework, with VL-WS achieving a mean Dice score of 91.64% and outperforming the CNN baseline by 4.98%. The largest gains occur on the most challenging weed class, where VL-WS attains 80.45% Dice score compared to 65.03% for the best baseline, representing a 15.42% improvement. VL-WS further maintains stable weed segmentation performance under limited target-domain supervision, indicating improved generalization and data efficiency. These findings highlight the potential of vision-language alignment to enable scalable, label-efficient segmentation models deployable across diverse real-world agricultural domains.