SPROUT: A Scalable Diffusion Foundation Model for Agricultural Vision

πŸ“… 2026-03-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge that general-purpose vision foundation models struggle to effectively handle the diversity and complexity of crop images in agricultural settings due to significant domain gaps. To overcome this limitation, the authors propose SPROUTβ€”the first agricultural vision foundation model based on a pixel-level diffusion Transformer architecture without a variational autoencoder (VAE). By leveraging a multi-crop strategy and denoising self-supervised pretraining, SPROUT enables end-to-end learning on 2.6 million agricultural images, capturing rich, structure-aware representations. The method establishes a unified multi-scale, multi-task framework that substantially outperforms both existing general-purpose and agriculture-specific models while significantly reducing the computational cost of pretraining.
πŸ“ Abstract
Vision Foundation Models (VFM) pre-trained on large-scale unlabeled data have achieved remarkable success on general computer vision tasks, yet typically suffer from significant domain gaps when applied to agriculture. In this context, we introduce $SPROUT$ ($S$calable $P$lant $R$epresentation model via $O$pen-field $U$nsupervised $T$raining), a multi-crop, multi-task agricultural foundation model trained via diffusion denoising. SPROUT leverages a VAE-free Pixel-space Diffusion Transformer to learn rich, structure-aware representations through denoising and enabling efficient end-to-end training. We pre-train SPROUT on a curated dataset of 2.6 million high-quality agricultural images spanning diverse crops, growth stages, and environments. Extensive experiments demonstrate that SPROUT consistently outperforms state-of-the-art web-pretrained and agricultural foundation models across a wide range of downstream tasks, while requiring substantially lower pre-training cost. The code and model are available at https://github.com/UTokyo-FieldPhenomics-Lab/SPROUT.
Problem

Research questions and friction points this paper is trying to address.

Vision Foundation Models
domain gap
agricultural vision
representation learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformer
VAE-free
Agricultural Foundation Model
Unsupervised Pre-training
Multi-crop Multi-task
πŸ”Ž Similar Papers
No similar papers found.
S
Shuai Xiang
Graduate School of Agricultural and Life Sciences, The University of Tokyo
W
Wei Guo
Graduate School of Agricultural and Life Sciences, The University of Tokyo
J
James Burridge
Graduate School of Agricultural and Life Sciences, The University of Tokyo
Shouyang Liu
Shouyang Liu
Professor, Nanjing Agricultural University
PhenotypingCrop modelingRemote sensing in agriculture
Hao Lu
Hao Lu
Associate Professor, Huazhong University of Science and Technology
Computer VisionDeep LearningPlant Phenotyping
T
Tokihiro Fukatsu
Institute of Agricultural Machinery, NARO; Institute of Life and Environmental Sciences, University of Tsukuba