SPROUT: A Scalable Diffusion Foundation Model for Agricultural Vision

📅 2026-03-29

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the challenge that general-purpose vision foundation models struggle to effectively handle the diversity and complexity of crop images in agricultural settings due to significant domain gaps. To overcome this limitation, the authors propose SPROUT—the first agricultural vision foundation model based on a pixel-level diffusion Transformer architecture without a variational autoencoder (VAE). By leveraging a multi-crop strategy and denoising self-supervised pretraining, SPROUT enables end-to-end learning on 2.6 million agricultural images, capturing rich, structure-aware representations. The method establishes a unified multi-scale, multi-task framework that substantially outperforms both existing general-purpose and agriculture-specific models while significantly reducing the computational cost of pretraining.

Technology Category

Application Category

📝 Abstract

Vision Foundation Models (VFM) pre-trained on large-scale unlabeled data have achieved remarkable success on general computer vision tasks, yet typically suffer from significant domain gaps when applied to agriculture. In this context, we introduce $SPROUT$ ($S$calable $P$lant $R$epresentation model via $O$pen-field $U$nsupervised $T$raining), a multi-crop, multi-task agricultural foundation model trained via diffusion denoising. SPROUT leverages a VAE-free Pixel-space Diffusion Transformer to learn rich, structure-aware representations through denoising and enabling efficient end-to-end training. We pre-train SPROUT on a curated dataset of 2.6 million high-quality agricultural images spanning diverse crops, growth stages, and environments. Extensive experiments demonstrate that SPROUT consistently outperforms state-of-the-art web-pretrained and agricultural foundation models across a wide range of downstream tasks, while requiring substantially lower pre-training cost. The code and model are available at https://github.com/UTokyo-FieldPhenomics-Lab/SPROUT.

Problem

Research questions and friction points this paper is trying to address.

Vision Foundation Models

domain gap

agricultural vision

representation learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformer

VAE-free

Agricultural Foundation Model

Unsupervised Pre-training