π€ AI Summary
This work addresses the challenge that general-purpose vision foundation models struggle to effectively handle the diversity and complexity of crop images in agricultural settings due to significant domain gaps. To overcome this limitation, the authors propose SPROUTβthe first agricultural vision foundation model based on a pixel-level diffusion Transformer architecture without a variational autoencoder (VAE). By leveraging a multi-crop strategy and denoising self-supervised pretraining, SPROUT enables end-to-end learning on 2.6 million agricultural images, capturing rich, structure-aware representations. The method establishes a unified multi-scale, multi-task framework that substantially outperforms both existing general-purpose and agriculture-specific models while significantly reducing the computational cost of pretraining.
π Abstract
Vision Foundation Models (VFM) pre-trained on large-scale unlabeled data have achieved remarkable success on general computer vision tasks, yet typically suffer from significant domain gaps when applied to agriculture. In this context, we introduce $SPROUT$ ($S$calable $P$lant $R$epresentation model via $O$pen-field $U$nsupervised $T$raining), a multi-crop, multi-task agricultural foundation model trained via diffusion denoising. SPROUT leverages a VAE-free Pixel-space Diffusion Transformer to learn rich, structure-aware representations through denoising and enabling efficient end-to-end training. We pre-train SPROUT on a curated dataset of 2.6 million high-quality agricultural images spanning diverse crops, growth stages, and environments. Extensive experiments demonstrate that SPROUT consistently outperforms state-of-the-art web-pretrained and agricultural foundation models across a wide range of downstream tasks, while requiring substantially lower pre-training cost. The code and model are available at https://github.com/UTokyo-FieldPhenomics-Lab/SPROUT.