🤖 AI Summary
This study addresses the degradation in pixel-level segmentation performance of barley field weeds in multispectral UAV imagery caused by cross-field, cross-seasonal, and illumination variations. To tackle this challenge, the authors propose VISA, a dual-stream segmentation network that, for the first time, decouples radiometrically calibrated five-band reflectance from vegetation indices and fuses them at native resolution. The architecture integrates residual spectral-spatial attention, windowed self-attention, state space layers, and Slot Attention to significantly enhance robustness in detecting sparse weeds within dense canopies. Evaluated on the newly curated four-year BAWSeg dataset—comprising radiometrically calibrated multispectral orthoimagery with dense annotations—the model achieves a weed IoU of 63.5% and an mIoU of 75.6% with 22.8M parameters, outperforming SegFormer-B1. Notably, it maintains strong generalization, with cross-field and cross-year mIoU scores of 71.2% and 69.2%, respectively.
📝 Abstract
Accurate weed mapping in cereal fields requires pixel-level segmentation from UAV imagery that remains reliable across fields, seasons, and illumination. Existing multispectral pipelines often depend on thresholded vegetation indices, which are brittle under radiometric drift and mixed crop--weed pixels, or on single-stream CNN and Transformer backbones that ingest stacked bands and indices, where radiance cues and normalized index cues interfere and reduce sensitivity to small weed clusters embedded in crop canopies. We propose VISA (Vegetation-Index and Spectral Attention), a two-stream segmentation network that decouples these cues and fuses them at native resolution. The radiance stream learns from calibrated five-band reflectance using residual spectral-spatial attention to preserve fine textures and row boundaries that are attenuated by ratio indices. The index stream operates on vegetation-index maps with windowed self-attention to model local structure efficiently, state-space layers to propagate field-scale context without quadratic attention cost, and Slot Attention to form stable region descriptors that improve discrimination of sparse weeds under canopy mixing. To support supervised training and deployment-oriented evaluation, we introduce BAWSeg, a four-year UAV multispectral dataset collected over commercial barley paddocks in Western Australia, providing radiometrically calibrated blue, green, red, red edge, and near-infrared orthomosaics, derived vegetation indices, and dense crop, weed, and other labels with leakage-free block splits. On BAWSeg, VISA achieves 75.6% mIoU and 63.5% weed IoU with 22.8M parameters, outperforming a multispectral SegFormer-B1 baseline by 1.2 mIoU and 1.9 weed IoU. Under cross-plot and cross-year protocols, VISA maintains 71.2% and 69.2% mIoU, respectively. The BAWSeg data, VISA code, and trained models will be released upon publication.